Fast bitwise AND buffers in C
`#include <stdint.h>
#include <malloc.h>
#define BUFSIZE 0x20000
void main(int argc, char ** argv) {
uint64_t * buf1;
uint64_t * buf2;
buf1 = (uint64_t *) malloc(BUFSIZE * sizeof(uint64_t));
buf2 = (uint64_t *) malloc(BUFSIZE * sizeof(uint64_t));
// (initialization stuff)
for(int i=0; i</malloc.h></stdint.h>`
I need to do this hundreds of times in a row, so I care about the speed of each iteration. Is there any faster way to do this, like a memcpy() variant with bitwise operations or something to that effect?[/i][/i]
8 Replies
uint64_t *p1;
uint64_t *p2;
...
p1 = buf1;
p2 = buf2;
for (i = 0; i < BUFSIZE; i++)
*p1++ &= *p2++;
I also find that some machines can count down faster than counting up, because they can compare to zero more quickly than comparing to the loop limit, e.g.:
i = BUFSIZE;
while (i--)
*p1++ &= *p2++;
edit: left out the &
Anyway, I did something ugly: I programmatically unrolled the entire loop, taking advantage of the fixed-size buffers. This produced 132K lines of C code, and a 6MB executable. It runs twice as fast!
Nobody said it has to be pretty.
@Guspaz:
Aren't there some vector versions of this that might speed it up? PAND or something? I think that works on 64-bit chunks in vectors, dunno if the compiler is smart enough to use the MMX instructions for it.
Thanks, this is what I was looking for. The compiler won't do it automatically; looks like it requires using inline assembly to call the MMX instructions.
I bet CUDA or OpenCL could parallelize this like crazy. Too bad Linodes don't have a GPU!
@Stever:
Depending on the compiler optimizations that are enabled, this might be faster:
uint64_t *p1; uint64_t *p2; ... p1 = buf1; p2 = buf2; for (i = 0; i < BUFSIZE; i++) *p1++ &= *p2++;
(stupid comment on BUFSIZE vs 64-bit pointer step removed because I didn't read the code properly…)
Edit: I'm an idiot. Forget I said anything…..
@funkytastic:
… I programmatically unrolled the entire loop …
-funroll-loops?
@Azathoth:
-funroll-loops?
http://gcc.gnu.org/onlinedocs/gcc-4.5.3 … ze-Options">http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Optimize-Options.html#Optimize-Options
Wow, that's so much better than the kludge I had going. It runs four times as fast as my pre-unrolled behemoth, and doesn't take 30 seconds for gcc to compile! Can't believe I didn't know about that. Thanks.
@pclissold:
funroll-loops.info
How informative.:)