Fast bitwise AND buffers in C

I have two one-megabyte buffers that I need to get the bitwise AND of. It's fine to overwrite one of the buffers to store the result. Currently I'm doing

`#include <stdint.h>
#include <malloc.h>

#define BUFSIZE 0x20000

void main(int argc, char ** argv) {
    uint64_t * buf1;
    uint64_t * buf2;

    buf1 = (uint64_t *) malloc(BUFSIZE * sizeof(uint64_t));
    buf2 = (uint64_t *) malloc(BUFSIZE * sizeof(uint64_t));

    // (initialization stuff)

    for(int i=0; i</malloc.h></stdint.h>`
I need to do this hundreds of times in a row, so I care about the speed of each iteration. Is there any faster way to do this, like a memcpy() variant with bitwise operations or something to that effect?[/i][/i]

8 Replies

Depending on the compiler optimizations that are enabled, this might be faster:

  uint64_t *p1;
  uint64_t *p2;

...

  p1 = buf1;
  p2 = buf2;

  for (i = 0; i < BUFSIZE; i++)
    *p1++ &= *p2++;

I also find that some machines can count down faster than counting up, because they can compare to zero more quickly than comparing to the loop limit, e.g.:

  i = BUFSIZE;
  while (i--)
    *p1++ &= *p2++;

edit: left out the & :oops:

Cool suggestion about counting down to zero. I remember doing that on some FPGA for a class years ago.

Anyway, I did something ugly: I programmatically unrolled the entire loop, taking advantage of the fixed-size buffers. This produced 132K lines of C code, and a 6MB executable. It runs twice as fast!

Nobody said it has to be pretty. :D

Aren't there some vector versions of this that might speed it up? PAND or something? I think that works on 64-bit chunks in vectors, dunno if the compiler is smart enough to use the MMX instructions for it.

@Guspaz:

Aren't there some vector versions of this that might speed it up? PAND or something? I think that works on 64-bit chunks in vectors, dunno if the compiler is smart enough to use the MMX instructions for it.
Thanks, this is what I was looking for. The compiler won't do it automatically; looks like it requires using inline assembly to call the MMX instructions.

I bet CUDA or OpenCL could parallelize this like crazy. Too bad Linodes don't have a GPU!

@Stever:

Depending on the compiler optimizations that are enabled, this might be faster:

  uint64_t *p1;
  uint64_t *p2;

...

  p1 = buf1;
  p2 = buf2;

  for (i = 0; i < BUFSIZE; i++)
    *p1++ &= *p2++;


(stupid comment on BUFSIZE vs 64-bit pointer step removed because I didn't read the code properly…)

Edit: I'm an idiot. Forget I said anything….. :oops:

@funkytastic:

… I programmatically unrolled the entire loop …
-funroll-loops?

http://gcc.gnu.org/onlinedocs/gcc-4.5.3 … ze-Options">http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Optimize-Options.html#Optimize-Options

@Azathoth:

-funroll-loops?

http://gcc.gnu.org/onlinedocs/gcc-4.5.3 … ze-Options">http://gcc.gnu.org/onlinedocs/gcc-4.5.3/gcc/Optimize-Options.html#Optimize-Options
Wow, that's so much better than the kludge I had going. It runs four times as fast as my pre-unrolled behemoth, and doesn't take 30 seconds for gcc to compile! Can't believe I didn't know about that. Thanks.

@pclissold:

funroll-loops.info
How informative. :)

Reply

Please enter an answer
Tips:

You can mention users to notify them: @username

You can use Markdown to format your question. For more examples see the Markdown Cheatsheet.

> I’m a blockquote.

I’m a blockquote.

[I'm a link] (https://www.google.com)

I'm a link

**I am bold** I am bold

*I am italicized* I am italicized

Community Code of Conduct