如何有效地洗牌?

I need to shuffle a 16 bit unsigned integer in a way that the even indexes land in the lower byte, and the odd indexes land in the upper byte.

我需要对一个16位无符号整数进行洗牌，使偶数索引落在下字节，奇数索引落在上字节。

input:
fedcba98 76543210 (contiguously numbered)

output:
fdb97531 eca86420 (even and odd separated)

My code looks like this at the moment:

我的代码现在是这样的:

typedef unsigned short u16;

u16 segregate(u16 x)
{
    u16 g = (x & 0x0001);
    u16 h = (x & 0x0004) >> 1;
    u16 i = (x & 0x0010) >> 2;
    u16 j = (x & 0x0040) >> 3;
    u16 k = (x & 0x0100) >> 4;
    u16 l = (x & 0x0400) >> 5;
    u16 m = (x & 0x1000) >> 6;
    u16 n = (x & 0x4000) >> 7;

    u16 o = (x & 0x0002) << 7;
    u16 p = (x & 0x0008) << 6;
    u16 q = (x & 0x0020) << 5;
    u16 r = (x & 0x0080) << 4;
    u16 s = (x & 0x0200) << 3;
    u16 t = (x & 0x0800) << 2;
    u16 u = (x & 0x2000) << 1;
    u16 v = (x & 0x8000);

    return g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v;
}

I wonder if there is a more elegant solution than simply extracting and shifting each individual bit?

我想知道是否有比简单地提取和移动每个位更优雅的解决方案?

7 个解决方案

#1

There is a very convenient web resource that helps solving many bit permutation problems: Code generator for bit permutations. In this particular case feeding "0 2 4 6 8 10 12 14 1 3 5 7 9 11 13 15" to this page produces pretty fast code.

有一个非常方便的web资源可以帮助解决许多位置换问题:位置换的代码生成器。在这个特殊的例子中，给这个页面输入“0 2 4 6 8 10 12 14 13 5 7 9 11 13 15”可以生成非常快的代码。

Unfortunately this code generator cannot produce 64-bit code (though anyone could download sources and add this option). So if we need to perform 4 permutations in parallel using 64-bit instructions, we have to extend all involved bitmasks to 64 bits manually:

不幸的是，此代码生成器不能生成64位代码(尽管任何人都可以下载源代码并添加此选项)。因此，如果我们需要使用64位指令并行执行4个排列，我们必须手动将所有涉及的位掩码扩展到64位:

uint64_t bit_permute_step(uint64_t x, uint64_t m, unsigned shift) {
  uint64_t t;
  t = ((x >> shift) ^ x) & m;
  x = (x ^ t) ^ (t << shift);
  return x;
}

uint64_t segregate4(uint64_t x)
{ // generated by http://programming.sirrida.de/calcperm.php, extended to 64-bit
  x = bit_permute_step(x, 0x2222222222222222ull, 1);
  x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0cull, 2);
  x = bit_permute_step(x, 0x00f000f000f000f0ull, 4);
  return x;
}

Level of parallelism could be increased even more (8 or 16 permutations at once) with SSE instructions. (And recent versions of gcc can vectorize this code automatically).

通过SSE指令，并行度可以增加更多(一次增加8或16个排列)。(而最近版本的gcc可以自动地对这些代码进行矢量化)。

If parallelism is not required and data cache is not extensively used by other parts of the program, better alternative would be to use lookup table. Various LUT approacehes are already discussed in other answers, still some more could be said here:

如果不需要并行性，并且程序的其他部分不广泛使用数据缓存，那么更好的选择是使用查找表。在其他答案中已经讨论了各种LUT方法，这里还可以说一些:

The first and the last bits of 16-bit word are never permuted, we need to shuffle only bits 1..14. So (if we want to perform the task with single LUT access) it is enough to have a LUT with 16K entries which means 32K of memory.
16位字的第一个和最后一个位永远不会被打乱，我们只需要将位1.. ..所以(如果我们想用单LUT访问来执行任务)拥有16K个条目的LUT就足够了，这意味着32K的内存。
We could combine table lookup and computation approaches. Two lookups in a single 256-byte table could shuffle each source byte separately. After this we only need to exchange two middle 4-bit nibbles. This allows to keep lookup table small, uses only 2 memory accesses, and needs not too much calculations (i.e. balances calculations and memory accesses).
我们可以结合表查找和计算方法。在一个256字节的表中进行两个查找可以分别对每个源字节进行洗牌。之后我们只需要交换两个中位4位的小块。这允许查找表保持小，只使用2个内存访问，并且不需要太多的计算(例如平衡计算和内存访问)。

Here is implementation of second approach:

以下是第二种方法的实施:

#define B10(x)          x+0x00,      x+0x10,      x+0x01,      x+0x11
#define B32(x)      B10(x+0x00), B10(x+0x20), B10(x+0x02), B10(x+0x22)
#define B54(x)      B32(x+0x00), B32(x+0x40), B32(x+0x04), B32(x+0x44)
uint8_t lut[256] = {B54(  0x00), B54(  0x80), B54(  0x08), B54(  0x88)};
#undef B54
#undef B32
#undef B10

uint_fast16_t segregateLUT(uint_fast16_t x)
{
  uint_fast16_t low = lut[x & 0x00ff];
  low |= low << 4;
  uint_fast16_t high = lut[x >> 8] << 4;
  high |= high << 4;
  return (low & 0x0f0f) | (high & 0xf0f0);
}

But fastest approach (if portability is not an issue) is using pext instruction from BMI2 instruction set as noted by Nils Pipenbrinck. With a pair of 64-bit pext we could perform 4 16-bit shuffles in parallel. Since pext instruction is intended exactly for this kind of bit permutations, this approach easily outperforms all others.

但是最快的方法(如果可移植性不是问题的话)是使用Nils Pipenbrinck指出的BMI2指令集的pext指令。使用一对64位pext，我们可以并行执行4个16位的切换。由于pext指令正是为这种位排列而设计的，因此这种方法很容易胜过其他所有方法。

#2

The table approach shown by others is the most portable version and is probably quite fast.

其他人展示的表格方法是最可移植的版本，而且可能非常快。

If you want to take advantage of special instruction sets there are some other options as well. For Intel Haswell and later for example the following approach can be used (requires the BMI2 instruction set extension):

如果您想利用特殊的指令集，还有一些其他的选项。对于Intel Haswell和稍后的例子，可以使用以下方法(需要BMI2指令集扩展):

unsigned segregate_bmi (unsigned arg)
{
  unsigned oddBits  = _pext_u32(arg,0x5555);
  unsigned evenBits = _pext_u32(arg,0xaaaa);
  return (oddBits | (evenBits << 8));
}

#3

You could use a 256-byte table for each byte of your 16-bit number, crafted so that your even/odd condition is satisfied. Hand-code the table entries (or use the algorithm you already have) to create the tables, and then the shuffling will be done at compile time. That would essentially be a translation table concept.

您可以为16位数字的每个字节使用256字节的表，这样可以使您的偶数/奇数条件得到满足。手工编写表项(或使用您已经拥有的算法)以创建表，然后在编译时进行转换。这实质上是一个平移表概念。

#4

You could use a 256-byte table for each byte of your 16-bit number, crafted so that your even/odd condition is satisfied.

您可以对16位数字的每个字节使用256字节的表，这样就可以满足偶数/奇数条件。

Ah yes, lookup tables to the rescue :) You can even do it with a single table and one extra shift:

啊，是的，查找表拯救:)你甚至可以做一个单表和一个额外的班次:

u16 every_other[256] = {
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x00, 0x01, 0x00, 0x01, 0x02, 0x03, 0x02, 0x03, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x04, 0x05, 0x04, 0x05, 0x06, 0x07, 0x06, 0x07, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x08, 0x09, 0x08, 0x09, 0x0a, 0x0b, 0x0a, 0x0b, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f, 
0x0c, 0x0d, 0x0c, 0x0d, 0x0e, 0x0f, 0x0e, 0x0f};

u16 segregate(u16 x)
{
    return every_other[x & 0xff]
         | every_other[(x >> 8)] << 4
         | every_other[(x >> 1) & 0xff] << 8
         | every_other[(x >> 9)] << 12;
}

#5

Tables. But generate them at compile time!

表。但是在编译时生成它们!

namespace details {
  constexpr uint8_t bit( unsigned byte, unsigned n ) {
    return (byte>>n)&1;
  }
  constexpr uint8_t even_bits(uint8_t byte) {
    return bit(byte, 0) | (bit(byte, 2)<<1) | (bit(byte, 4)<<2) | (bit(byte, 6)<<3);
  }
  constexpr uint8_t odd_bits(uint8_t byte) {
    return even_bits(byte/2);
  }
  template<unsigned...>struct indexes{using type=indexes;};
  template<unsigned Max,unsigned...Is>struct make_indexes:make_indexes<Max-1,Max-1,Is...>{};
  template<unsigned...Is>struct make_indexes<0,Is...>:indexes<Is...>{};
  template<unsigned Max>using make_indexes_t=typename make_indexes<Max>::type;

  template<unsigned...Is>
  constexpr std::array< uint8_t, 256 > even_bit_table( indexes<Is...> ) {
    return { even_bits(Is)... };
  }
  template<unsigned...Is>
  constexpr std::array< uint8_t, 256 > odd_bit_table( indexes<Is...> ) {
    return { odd_bits(Is)... };
  }
  constexpr std::array< uint8_t, 256 > even_bit_table() {
    return even_bit_table( make_indexes_t<256>{} );
  }
  constexpr std::array< uint8_t, 256 > odd_bit_table() {
    return odd_bit_table( make_indexes_t<256>{} );
  }

  static constexpr auto etable = even_bit_table();
  static constexpr auto otable = odd_bit_table();
}

uint8_t constexpr even_bits( uint16_t in ) {
  return details::etable[(uint8_t)in] | ((details::etable[(uint8_t)(in>>8)])<<4);
}
uint8_t constexpr odd_bits( uint16_t in ) {
  return details::otable[(uint8_t)in] | ((details::otable[(uint8_t)(in>>8)])<<4);
}

live example

生活的例子

#6

your answer to the even and odd bits shuffle for 64 bits is not accurate. To extend the 16 bit solution to 64 bit solution, we need not only to extend the masks, but also cover the swapping interval from 1 all the way to 16:

你对偶数和奇数位洗牌的答案是64位不准确。为了将16位的解决方案扩展到64位的解决方案，我们不仅需要扩展掩码，还需要覆盖从1到16的交换间隔:

x = bit_permute_step(x, 0x2222222222222222, 1);
x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0c, 2);
x = bit_permute_step(x, 0x00f000f000f000f0, 4);
**x = bit_permute_step(x, 0x0000ff000000ff00, 8);
x = bit_permute_step(x, 0x00000000ffff0000, 16);**

#7

In favour of being short:

赞成简短:

unsigned short segregate(unsigned short x)
{
    x = (x & 0x9999) | (x >> 1 & 0x2222) | (x << 1 & 0x4444);
    x = (x & 0xC3C3) | (x >> 2 & 0x0C0C) | (x << 2 & 0x3030);
    x = (x & 0xF00F) | (x >> 4 & 0x00F0) | (x << 4 & 0x0F00);
    return x;
}

#1

uint64_t bit_permute_step(uint64_t x, uint64_t m, unsigned shift) {
  uint64_t t;
  t = ((x >> shift) ^ x) & m;
  x = (x ^ t) ^ (t << shift);
  return x;
}

uint64_t segregate4(uint64_t x)
{ // generated by http://programming.sirrida.de/calcperm.php, extended to 64-bit
  x = bit_permute_step(x, 0x2222222222222222ull, 1);
  x = bit_permute_step(x, 0x0c0c0c0c0c0c0c0cull, 2);
  x = bit_permute_step(x, 0x00f000f000f000f0ull, 4);
  return x;
}