2007-09-08

Magic popcount (popcnt) command

From Frank de Groot blog:
Every serious hacker sooner or later needs the popcount instruction.
This "population count" instruction counts the set bits in a register, and is so useful that the NSA demands that all computers they purchase implement it in hardware.


But this command is not present at x86 architecture. AMD64 documentation (p. 188) supports popcount extension, but I never heard of any processor that implements this feature.

You may wonder why would one need popcount instruction. I needed it for a very simple task. I wanted to change data structure from bitfield to list of bit positions. Basically I needed something like this:

0x1001 -> [0, 12]
0xF000 -> [12,13,14,15]
The simplest solution is to iterate over every bit and check if it's set. More complicated algorithm could count number of set bits and index possible results using this number. I wonder if that approach is faster. It's not so simple to say, because low level performance issues could have great role in this problem.

I wonder what is the fastest possible method for this problem. I bet that popcount instruction is the key to efficient implementation.

I found some interesting popcount implementations in software.
GCC approach, sum bits for every byte:
const UQItype __popcount_tab[] =
{
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8,
};

int __popcountSI2 (UWtype x)
{
UWtype i, ret = 0;
for (i = 0; i < W_TYPE_SIZE; i += 8)
ret += __popcount_tab[(x >> i) & 0xff];
return ret;
}
Method based on subtraction, from GnuLib:

int popcount(unsigned int x){
int pop;
for (pop = 0; x != 0; pop++)
x &= x - 1;
return pop;
}
AMD64 Optimization Guide (p. 137): (they have also mmx version)
unsigned int popcount(unsigned int v) 
{
unsigned int retVal;
__asm {
MOV EAX, [v] ;v
MOV EDX, EAX ;v
SHR EAX, 1 ;v >> 1
AND EAX, 055555555h ;(v >> 1) & 0x55555555
SUB EDX, EAX ;w = v - ((v >> 1) & 0x55555555)
MOV EAX, EDX ;w
SHR EDX, 2 ;w >> 2
AND EAX, 033333333h ;w & 0x33333333
AND EDX, 033333333h ;(w >> 2) & 0x33333333
ADD EAX, EDX ;x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
MOV EDX, EAX ;x
SHR EAX, 4 ;x >> 4
ADD EAX, EDX ;x + (x >> 4)
AND EAX, 00F0F0F0Fh ;y = (x + (x >> 4) & 0x0F0F0F0F)
IMUL EAX, 001010101h ;y * 0x01010101
SHR EAX, 24 ;population count = (y * 0x01010101) >> 24
MOV retVal, EAX ;store result
}
return (retVal);
}
There is also interesting topic aboutpopcount at programming.reddit.com. Yet another source, John-Mark Gurney tips about popcount implementaion.


6 comments:

AMB said...

This was extremely helpful. Helped me both simplify and speed up my pop count for an absurdly large hash table.

Thanks for the awesome article!

Regards,
Aaron M Brown

majek said...

http://www.reddit.com/r/programming/comments/84sht/fast_bit_couting_algorithms/

Eddie said...

Hi! -- Majek's, I am who said "hi" yesterday.

I have learned that newer processors (AMD's Barcelona and Intel's Processors) implement a processor instruction called POPCNT. Furthermore, you can use it in gcc through the builtins __builtin_popcount [and ...countl and ...countll depending on the size of the argument] if you enable -msse4.2 [See here] It works in my Nehalem, but I don't know if it also works in Barcelona ('cos in AMD processors popcnt belongs to sse5).

Also, the book "Beautiful Code" contains a fascinating chapter on this function.

Inspired by it, I programmed a routine for the population count of a sizeable array, in which "popcount" is the basic building block and "csa" implements a "Carry Save Adder" that reduces the number of actual popcounts by 1/3 using simple bit instructions (the explanation of why this works is in the book).

My homage to a fellow blogger whose blog name is "POPCNT"

inline void csa(u64 &high, u64 &low, u64 a, u64 b, u64 c) {
  u64 axb = a ^ b;
  low = axb ^ c;
  high = (a & b) | (axb & c);
}

inline unsigned array_popcount_csa8(u64 *base, int count) {
  int i;
  u64 units, duos, quads, v, w, x, y, z;
  unsigned rv = 0;
  units = duos = quads = 0;
  for(i = 7; i < count; i += 8) {
    // 1/units 2/duos 4/quads F/v, w, x, y, z
    csa(v, w, units, base[i - 7], base[i - 6]);
    // 1/w 2/v, duos 4/quads F/units, x, y, z
    csa(x, units, w, base[i - 5], base[i - 4]);
    // 1/units 2/x, v, duos 4/quads F/w, y, z
      csa(w, y, x, v, duos);
      // 1/units 2/y 4/w, quads F/duos, v, x, z
    csa(duos, v, units, base[i - 3], base[i - 2]);
    // 1/v 2/duos, y 4/w, quads F/units, x, z
    csa(x, units, v, base[i - 1], base[i]);
    // 1/units 2/x, duos, y 4/w, quads F/v, z
      csa(z, v, x, duos, y);
      // 1/units 2/v 4/z, w, quads F/duos, x, y
        csa(x, y, z, w, quads);
        // 1/units 2/v 4/y F/duos, w, z
    duos = v;
    quads = y;
    rv += popcount(x);
  }
  rv = rv * 8 + popcount(quads) * 4 + popcount(duos) * 2 + popcount(units);
  i -= 7;
  if(i < count) rv += array_popcount_csa4(base + i, count - i);
  return rv;
}

Mike Blaszczak said...

I don't think the approach of finding the number of bits set and then looking at the possible number of bits works out for your problem. In a 16 bit word, there are 1728 ways to have 4 bits set, for example. POPCNT will tell you that you have 4 bits set -- now what? You have to figure out which of the possible 1728 answers is the one you want. [1, 2, 3, 4], [1, 2, 3, 5], [1, 2, 3, 6] ... [13, 14, 15, 16]? That's a lot of testing!

POPCNT is interesting, but I don't think it helps you too much with your problem.

I would suggest that your problem is best solved with divide and conquer.


#include "stdio.h"

void FindBitsSetNibble( int nOffset, int n, int* pCountOfBitsSet, int * pBitsSet )
{
if ( n & 1 ) { pBitsSet[(*pCountOfBitsSet)++] = nOffset; }
if ( n & 2 ) { pBitsSet[(*pCountOfBitsSet)++] = nOffset + 1; }
if ( n & 4 ) { pBitsSet[(*pCountOfBitsSet)++] = nOffset + 2; }
if ( n & 8 ) { pBitsSet[(*pCountOfBitsSet)++] = nOffset + 3; }
}

void FindBitsSetWord16( int n, int* pCountOfBitsSet, int * pBitsSet )
{
*pCountOfBitsSet = 0;
FindBitsSetNibble( 0, n & 0x000F, pCountOfBitsSet, pBitsSet );
FindBitsSetNibble( 4, (n & 0x00F0) >> 4, pCountOfBitsSet, pBitsSet );
FindBitsSetNibble( 8, (n & 0x0F00) >> 8, pCountOfBitsSet, pBitsSet );
FindBitsSetNibble( 12, (n & 0xF000) >> 12, pCountOfBitsSet, pBitsSet );
return;
}

void PrintBitsSet( int n )
{
int nBitsSet[16];
int nCountOfBitsSet;

printf( "0x%4.4X -> ", n );
FindBitsSetWord16( n, &nCountOfBitsSet, nBitsSet );
for ( int n = 0; n < nCountOfBitsSet; n++ )
printf( "%s%d", (n == 0) ? "" : ", ", nBitsSet[n] );
printf("\n");
}

int main()
{
PrintBitsSet( 0x1001 );
PrintBitsSet( 0xF000 );

return 0;
}


Even in the inner nibble function, POPCNT doesn't help you because there are too many possibilities. You could, instead of doing conditionals, do a look up into an array of initialized lists, though.

Paul Dent said...

I just saw the popcnt instruction in SSE4 so googled it to see if it meant what I thought it meant. It is a very important instruction. I have seen it on other machine architectures years ago. It was called "collate 1's" then.

It is extremely useful for correlating two words or bitstreams. You XOR them, then "collate 1's". The result is the number of bit disagreements between the bitstreams. Very useful for cryptoanalysis and error correction decoding.

Paul Dent said...

I just saw the popcnt instruction in SSE4 so googled it to see if it meant what I thought it meant. It is a very important instruction. I have seen it on other machine architectures years ago. It was called "collate 1's" then.

It is extremely useful for correlating two words or bitstreams. You XOR them, then "collate 1's". The result is the number of bit disagreements between the bitstreams. Very useful for cryptoanalysis and error correction decoding.