What I want to do is take a 64-bit unsigned integer consisting of pairs of bits and create from it a 32-bit integer containing 0 if both bits in the corresponding pair are 0 and 1 otherwise. In other words, convert something that looks like :
我要做的是取一个由对比特组成的64位无符号整数,并从中创建一个包含0的32位整数,如果对应对中的两个比特都是0,否则就是1。换句话说,转换如下内容:
01 00 10 11
into something that looks like this
变成这样
1 0 1 1
The two obvious solutions are either the brute force loop or a lookup table for each byte and then do eight lookups and combine them into a final result with OR and bit shifting but I'm sure there should be an efficient means of bit-twiddling this. I will be doing this for a 64-bit integers in C++ but if anyone knows of efficient ways to do this for shorter integers I'm sure I can figure out how to scale it up.
这两种明显的解决方案要么是蛮力循环,要么是每个字节的查找表,然后进行8次查找,并将它们组合成最终的结果,进行或和位移动,但我确信应该有一种有效的方法来进行位旋转。我将在c++中对一个64位整数做这个,但是如果有人知道对更短的整数做这个的有效方法的话,我肯定能算出如何放大它。
7 个解决方案
#1
54
Here is a portable C++ implementation. It seems to work during my brief testing. The deinterleave code is based on this SO question.
这是一个可移植的c++实现。在我的简短测试中,它似乎起作用了。deinterleave代码基于这个SO问题。
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);
// deinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;
return x;
}
gcc, clang, and msvc all compile this down to about 30 instructions.
gcc、clang和msvc都将其编译为大约30条指令。
From the comments, there is a modification that can be made.
在评论中,可以做一些修改。
- Change the first line to use a single bitmask operation to select only the "odd" bits.
- 更改第一行使用单个位掩码操作,只选择“奇数”位。
The possibly (?) improved code is:
改进后的代码可能是:
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits
// ... the restdeinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words
return x;
}
#2
43
Probably fastest solution for x86 architecture with BMI2 instruction set:
使用BMI2指令集的x86架构的最快解决方案:
#include <stdint.h>
#include <x86intrin.h>
uint32_t calc (uint64_t a)
{
return _pext_u64(a, 0x5555555555555555ull) |
_pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}
This compiles to 5 instructions total.
这个编译总共有5条指令。
#3
14
If you do not have pext
and you still want to do this better than the trivial way then this extraction can be expressed as a logarithmic number (if you generalized it in terms of length) of bit moves:
如果你没有pext,你仍然想要比平凡的方法做得更好,那么这个提取可以用位移的对数数来表示(如果你用长度来概括的话):
// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1); // make pairs
x = bitmove(x, rep8(0x30), 2); // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler
With:
:
bitmove(x, mask, step) {
return x | ((x & mask) >> step);
}
repk
is just so I could write shorter constants. rep8(0x44) = 0x4444444444444444
etc.
repk是为了写更短的常数。rep8(0 x44)= 0 x4444444444444444等等。
Also if you do have pext
, you can do it with only one of them, which is probably faster and at least shorter:
另外,如果你有pext,你只能用其中一个,它可能更快,至少更短:
_pext_u64(x | (x >> 1), rep8(0x55));
#4
10
Okay, let's make this more hacky then (might be buggy):
好吧,让我们把这个弄得更陈腐(可能有问题):
uint64_t x;
uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits = x & 0x5555555555555555ull;
Now, my original solution did this:
原来的解是这样的
// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;
However, as JackAidley pointed out, while this aligns the bits together, it doesn't remove the spaces from the middle!
然而,正如JackAidley所指出的,虽然这将位元对齐在一起,但它并没有从中间删除空格!
Thankfully, we can use a very helpful _pext
instruction from the BMI2 instruction set.
值得庆幸的是,我们可以使用来自BMI2指令集的非常有用的_pext指令。
u64 _pext_u64(u64 a, u64 m)
- Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.u64 _pext_u64(u64 a, u64 m)—从a中提取掩码m指定的相应位,以获取dst中相邻的低位;dst中剩余的上部位被设置为零。
solution = _pext_u64(solution, odd_bits);
Alternatively, instead of using &
and >>
to separate out the bits, you might just use _pext
twice on the original number with the provided masks (which would split it up into two contiguous 32-bit numbers), and then simply or
the results.
或者,与其使用&和>>来分离比特,不如在提供的掩码(将其分割为两个连续的32位数字)的原始数字上使用_pext两次,然后简单地或者直接地使用结果。
If you don't have access to BMI2, though, I'm pretty sure the gap removal would still involve a loop; a bit simpler one than your original idea, perhaps.
但是,如果您没有访问BMI2的权限,我很肯定间隙消除仍然需要一个循环;也许比你最初的想法简单一点。
#5
7
Slight improvement over the LUT approach (4 lookups instead of 8):
与LUT方法相比略有改进(4个查找而不是8个):
Compute the bitwise-or and clear every other bit. Then intertwine the bits of pairs of bytes to yield four bytes. Finally, reorder the bits in the four bytes (mapped on the quadword) by means of a 256-entries lookup-table:
计算位,或者清除其他所有位。然后将成对的字节缠绕在一起,产生4个字节。最后,通过一个256个条目的查找表对4个字节中的位重新排序:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
#6
6
The hard part seem to be to pack the bits after oring. The oring is done by:
最难的部分似乎是在打完订单后打包。操作规程如下:
ored = (x | (x>>1)) & 0x5555555555555555;
(assuming large enough int
s so we don't have to use suffixes). Then we can pack then in steps first packing them two-by-two, four-by-four etc:
(假设有足够大的ints,所以我们不需要使用后缀)。然后我们可以分步骤包装,首先是2乘2,4乘4,等等:
pack2 = ((ored*3) >> 1) & 0x333333333333;
pack4 = ((ored*5) >> 2) & 0x0F0F0F0F0F0F;
pack8 = ((ored*17) >> 4) & 0x00FF00FF00FF;
pac16 = ((ored*257) >> 8) & 0x0000FFFF0000FFFF;
pack32 = ((ored*65537) >> 16) & 0xFFFFFFFF;
// (or cast to uint32_t instead of the final & 0xFFF...)
The thing that happens in the packing is that by multiplying we combine the data with the data shifted. In your example we would have in first multiplication (I denote zeros from the masking in ored
as o
and the other 0
(from the original data)):
在包装中发生的事情是,我们将数据与数据相结合。在您的示例中,我们将在第一次乘法中使用(我将屏蔽的0表示为o,另一个0表示为0(来自原始数据):
o1o0o1o1
x 11
----------
o1o0o1o1
o1o0o1o1
----------
o11001111
^^ ^^
o10oo11o < these are bits we want to keep.
We could have done that by oring as well:
我们也可以通过使用oring来实现:
ored = (ored | (ored>>1)) & 0x3333333333333333;
ored = (ored | (ored>>2)) & 0x0F0F0F0F0F0F0F0F;
ored = (ored | (ored>>4)) & 0x00FF00FF00FF00FF;
ored = (ored | (ored>>8)) & 0x0000FFFF0000FFFF;
ored = (ored | (ored>>16)) & 0xFFFFFFFF;
// ored = ((uint32_t)ored | (uint32_t)(ored>>16)); // helps some compilers make better code, esp. on x86
#7
1
I made some vectorized versions (godbolt link still with some big design-notes comments) and did some benchmarks when this question was new. I was going to spend more time on it, but never got back to it. Posting what I have so I can close this browser tab. >.< Improvements welcome.
我做了一些矢量化的版本(godbolt链接仍然有一些大型的设计注释注释),并且在这个问题是新的时候做了一些基准测试。我本来打算花更多的时间在这上面,但却再也没有回来。张贴我所拥有的以便我可以关闭浏览器标签。>。 <欢迎的改进。< p>
I don't have a Haswell I could test on, so I couldn't benchmark the pextr
version against this. I'm sure it's faster, though, since it's only 4 fast instructions.
我没有可以测试的Haswell,所以我不能将pextr版本作为基准。我肯定它更快,因为它只有4个快速指令。
*** Sandybridge (i5-2500k, so no hyperthreading)
*** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
TODO: update benchmarks for latest code changes
total cycles, and insn/clock, for the test-loop
This measures only throughput, not latency,
and a bottleneck on one execution port might make a function look worse in a microbench
than it will do when mixed with other code that can keep the other ports busy.
Lower numbers in the first column are better:
these are total cycle counts in Megacycles, and correspond to execution time
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).
AVX no AVX
887.519Mc 2.70Ipc 887.758Mc 2.70Ipc use_orbits_shift_right
1140.68Mc 2.45Ipc 1140.47Mc 2.46Ipc use_orbits_mul (old version that right-shifted after each)
718.038Mc 2.79Ipc 716.452Mc 2.79Ipc use_orbits_x86_lea
767.836Mc 2.74Ipc 1027.96Mc 2.53Ipc use_orbits_sse2_shift
619.466Mc 2.90Ipc 816.698Mc 2.69Ipc use_orbits_ssse3_shift
845.988Mc 2.72Ipc 845.537Mc 2.72Ipc use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc 2.92Ipc 686.792Mc 2.91Ipc use_orbits_ssse3_interleave_scalar
547.386Mc 2.92Ipc 730.259Mc 2.88Ipc use_orbits_ssse3_interleave
The fastest (for throughput in a loop) with AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.
AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time
// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles / {c=$1; gsub(",", "", c); } /insns per cy/ {print c / 1000000 "Mc " $4"Ipc"}' *"$i"*;done | column -c 50 -x
// Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written
The almost certainly best version (with BMI2) is:
几乎可以肯定最好的版本(使用BMI2)是:
#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull
uint32_t orbits_1pext (uint64_t a) {
// a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
return _pext_u64( a | a<<1, HIBITS64);
}
This compiles to:
这个编译:
lea rax, [rdi+rdi]
or rdi, rax
movabs rax, -6148914691236517206
pext rax, rdi, rax
ret
So it's only 4 uops, and the critical path latency is 5c = 3(pext) + 1(or) + 1(lea). (Intel Haswell). The throughput should be one result per cycle (with no loop overhead or loading/storing). The mov imm
for the constant can be hoisted out of a loop, though, since it isn't destroyed. This means throughput-wise that we only need 3 fused-domain uops per result.
所以只有4 uops,关键路径延迟是5c = 3(pext) + 1(or) + 1(lea)。(英特尔Haswell)。吞吐量应该是每个周期的一个结果(没有循环开销或装载/存储)。但是,由于mov imm没有被破坏,所以它可以从一个循环中提升出来。这意味着在吞吐量方面,每个结果只需要3个模糊域uops。
The mov r, imm64
isn't ideal. (A 1uop broadcast-immediate 32 or 8bit to a 64bit reg would be ideal, but there's no such instruction). Having the constant in data memory is an option, but inline in the instruction stream is nice. A 64b constant takes a lot of uop-cache space, which makes the version that does pext
with two different masks even worse. Generating one mask from the other with a not
could help with that, though: movabs
/ pext
/ not
/ pext
/ or
, but that's still 5 insns compared to the 4 enabled by the lea
trick.
mov r和imm64并不理想。(1uop广播-立即32或8bit到64位的reg是理想的,但是没有这样的指令)。在数据内存中使用常量是一种选择,但是在指令流中内联是很好的。64b常量会占用大量的uop-cache空间,这使得使用两个不同掩码处理pext的版本更加糟糕。用not生成另一个掩码可能会有帮助,但是:movabs / pext / not / pext / or,但与lea技巧启用的4个insns相比,仍然是5个insns。
The best version (with AVX) is:
最好的版本是:
#include <immintrin.h>
/* Yves Daoust's idea, operating on nibbles instead of bytes:
original:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
To operate on nibbles,
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
Q|= Q>>5 // Intertwine 8 nibbles into 8 bytes
// pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
// right-shift and OR to combine nibbles
// pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
// do some of this in GP regs if not doing two 64b elements in parallel.
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// but VEX-encoded SSE saves mov instructions
__m128i a = _mm_cvtsi64_si128(scalar_a);
// element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
__m128i lshift = _mm_slli_epi64(a, 1);
lshift = _mm_or_si128(lshift, a);
lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
// a = bits: h g f e d c b a (same thing in other bytes)
// lshift = hg 0 fe 0 dc 0 ba 0
// lshift = s 0 r 0 q 0 p 0
// lshift = s 0 r 0 q 0 p 0
__m128i rshift = _mm_srli_epi64(lshift, 5); // again, element size doesn't matter, we're keeping only the low nibbles
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
__m128i nibbles = _mm_or_si128(rshift, lshift);
nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) ); // have to zero the high nibbles: the sign bit affects pshufb
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
-1,-1,-1,-1, 6, 4, 2,0) );
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
// _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}
The best version without AVX is a slight modification that only works with one input at a time, only using SIMD for the shuffling. In theory using MMX instead of SSE would make more sense, esp. if we care about first-gen Core2 where 64b pshufb is fast, but 128b pshufb is not single cycle. Anyway, compilers did a bad job with MMX intrinsics. Also, EMMS is slow.
没有AVX的最佳版本是一个很小的修改,每次只对一个输入起作用,只对变换使用SIMD。在理论上,使用MMX而不是SSE会更有意义,尤其是如果我们关心第一代Core2,其中64b pshufb是快的,但128b pshufb不是单周期。不管怎样,编译器在MMX intrinsic上做得很糟糕。此外,EMMS是缓慢的。
// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding
// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX. Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
uint64_t lshift = (scalar_a | scalar_a << 1);
lshift &= HIBITS64;
uint64_t rshift = lshift >> 5;
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
// have to zero the high nibbles: the sign bit affects pshufb
__m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}
Sorry for the mostly code-dump answer. At this point I didn't feel like it's worth spending a huge amount of time on discussing things more than the comments already do. See http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the x86 wiki for other resources.
抱歉,我的答案大多是代码转储。在这一点上,我觉得花大量的时间讨论事情比评论已经做的更多是不值得的。请参见http://agner.org/optimize/为特定的微架构优化指南。还有其他资源的x86 wiki。
#1
54
Here is a portable C++ implementation. It seems to work during my brief testing. The deinterleave code is based on this SO question.
这是一个可移植的c++实现。在我的简短测试中,它似乎起作用了。deinterleave代码基于这个SO问题。
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n & 0x5555555555555555ull) | ((n & 0xAAAAAAAAAAAAAAAAull) >> 1);
// deinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull;
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full;
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull;
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull;
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull;
return x;
}
gcc, clang, and msvc all compile this down to about 30 instructions.
gcc、clang和msvc都将其编译为大约30条指令。
From the comments, there is a modification that can be made.
在评论中,可以做一些修改。
- Change the first line to use a single bitmask operation to select only the "odd" bits.
- 更改第一行使用单个位掩码操作,只选择“奇数”位。
The possibly (?) improved code is:
改进后的代码可能是:
uint64_t calc(uint64_t n)
{
// (odd | even)
uint64_t x = (n | (n >> 1)) & 0x5555555555555555ull; // single bits
// ... the restdeinterleave
x = (x | (x >> 1)) & 0x3333333333333333ull; // bit pairs
x = (x | (x >> 2)) & 0x0F0F0F0F0F0F0F0Full; // nibbles
x = (x | (x >> 4)) & 0x00FF00FF00FF00FFull; // octets
x = (x | (x >> 8)) & 0x0000FFFF0000FFFFull; // halfwords
x = (x | (x >> 16)) & 0x00000000FFFFFFFFull; // words
return x;
}
#2
43
Probably fastest solution for x86 architecture with BMI2 instruction set:
使用BMI2指令集的x86架构的最快解决方案:
#include <stdint.h>
#include <x86intrin.h>
uint32_t calc (uint64_t a)
{
return _pext_u64(a, 0x5555555555555555ull) |
_pext_u64(a, 0xaaaaaaaaaaaaaaaaull);
}
This compiles to 5 instructions total.
这个编译总共有5条指令。
#3
14
If you do not have pext
and you still want to do this better than the trivial way then this extraction can be expressed as a logarithmic number (if you generalized it in terms of length) of bit moves:
如果你没有pext,你仍然想要比平凡的方法做得更好,那么这个提取可以用位移的对数数来表示(如果你用长度来概括的话):
// OR adjacent bits, destroys the odd bits but it doesn't matter
x = (x | (x >> 1)) & rep8(0x55);
// gather the even bits with delta swaps
x = bitmove(x, rep8(0x44), 1); // make pairs
x = bitmove(x, rep8(0x30), 2); // make nibbles
x = bitmove(x, rep4(0x0F00), 4); // make bytes
x = bitmove(x, rep2(0x00FF0000), 8); // make words
res = (uint32_t)(x | (x >> 16)); // final step is simpler
With:
:
bitmove(x, mask, step) {
return x | ((x & mask) >> step);
}
repk
is just so I could write shorter constants. rep8(0x44) = 0x4444444444444444
etc.
repk是为了写更短的常数。rep8(0 x44)= 0 x4444444444444444等等。
Also if you do have pext
, you can do it with only one of them, which is probably faster and at least shorter:
另外,如果你有pext,你只能用其中一个,它可能更快,至少更短:
_pext_u64(x | (x >> 1), rep8(0x55));
#4
10
Okay, let's make this more hacky then (might be buggy):
好吧,让我们把这个弄得更陈腐(可能有问题):
uint64_t x;
uint64_t even_bits = x & 0xAAAAAAAAAAAAAAAAull;
uint64_t odd_bits = x & 0x5555555555555555ull;
Now, my original solution did this:
原来的解是这样的
// wrong
even_bits >> 1;
unsigned int solution = even_bits | odd_bits;
However, as JackAidley pointed out, while this aligns the bits together, it doesn't remove the spaces from the middle!
然而,正如JackAidley所指出的,虽然这将位元对齐在一起,但它并没有从中间删除空格!
Thankfully, we can use a very helpful _pext
instruction from the BMI2 instruction set.
值得庆幸的是,我们可以使用来自BMI2指令集的非常有用的_pext指令。
u64 _pext_u64(u64 a, u64 m)
- Extract bits from a at the corresponding bit locations specified by mask m to contiguous low bits in dst; the remaining upper bits in dst are set to zero.u64 _pext_u64(u64 a, u64 m)—从a中提取掩码m指定的相应位,以获取dst中相邻的低位;dst中剩余的上部位被设置为零。
solution = _pext_u64(solution, odd_bits);
Alternatively, instead of using &
and >>
to separate out the bits, you might just use _pext
twice on the original number with the provided masks (which would split it up into two contiguous 32-bit numbers), and then simply or
the results.
或者,与其使用&和>>来分离比特,不如在提供的掩码(将其分割为两个连续的32位数字)的原始数字上使用_pext两次,然后简单地或者直接地使用结果。
If you don't have access to BMI2, though, I'm pretty sure the gap removal would still involve a loop; a bit simpler one than your original idea, perhaps.
但是,如果您没有访问BMI2的权限,我很肯定间隙消除仍然需要一个循环;也许比你最初的想法简单一点。
#5
7
Slight improvement over the LUT approach (4 lookups instead of 8):
与LUT方法相比略有改进(4个查找而不是8个):
Compute the bitwise-or and clear every other bit. Then intertwine the bits of pairs of bytes to yield four bytes. Finally, reorder the bits in the four bytes (mapped on the quadword) by means of a 256-entries lookup-table:
计算位,或者清除其他所有位。然后将成对的字节缠绕在一起,产生4个字节。最后,通过一个256个条目的查找表对4个字节中的位重新排序:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL; // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
#6
6
The hard part seem to be to pack the bits after oring. The oring is done by:
最难的部分似乎是在打完订单后打包。操作规程如下:
ored = (x | (x>>1)) & 0x5555555555555555;
(assuming large enough int
s so we don't have to use suffixes). Then we can pack then in steps first packing them two-by-two, four-by-four etc:
(假设有足够大的ints,所以我们不需要使用后缀)。然后我们可以分步骤包装,首先是2乘2,4乘4,等等:
pack2 = ((ored*3) >> 1) & 0x333333333333;
pack4 = ((ored*5) >> 2) & 0x0F0F0F0F0F0F;
pack8 = ((ored*17) >> 4) & 0x00FF00FF00FF;
pac16 = ((ored*257) >> 8) & 0x0000FFFF0000FFFF;
pack32 = ((ored*65537) >> 16) & 0xFFFFFFFF;
// (or cast to uint32_t instead of the final & 0xFFF...)
The thing that happens in the packing is that by multiplying we combine the data with the data shifted. In your example we would have in first multiplication (I denote zeros from the masking in ored
as o
and the other 0
(from the original data)):
在包装中发生的事情是,我们将数据与数据相结合。在您的示例中,我们将在第一次乘法中使用(我将屏蔽的0表示为o,另一个0表示为0(来自原始数据):
o1o0o1o1
x 11
----------
o1o0o1o1
o1o0o1o1
----------
o11001111
^^ ^^
o10oo11o < these are bits we want to keep.
We could have done that by oring as well:
我们也可以通过使用oring来实现:
ored = (ored | (ored>>1)) & 0x3333333333333333;
ored = (ored | (ored>>2)) & 0x0F0F0F0F0F0F0F0F;
ored = (ored | (ored>>4)) & 0x00FF00FF00FF00FF;
ored = (ored | (ored>>8)) & 0x0000FFFF0000FFFF;
ored = (ored | (ored>>16)) & 0xFFFFFFFF;
// ored = ((uint32_t)ored | (uint32_t)(ored>>16)); // helps some compilers make better code, esp. on x86
#7
1
I made some vectorized versions (godbolt link still with some big design-notes comments) and did some benchmarks when this question was new. I was going to spend more time on it, but never got back to it. Posting what I have so I can close this browser tab. >.< Improvements welcome.
我做了一些矢量化的版本(godbolt链接仍然有一些大型的设计注释注释),并且在这个问题是新的时候做了一些基准测试。我本来打算花更多的时间在这上面,但却再也没有回来。张贴我所拥有的以便我可以关闭浏览器标签。>。 <欢迎的改进。< p>
I don't have a Haswell I could test on, so I couldn't benchmark the pextr
version against this. I'm sure it's faster, though, since it's only 4 fast instructions.
我没有可以测试的Haswell,所以我不能将pextr版本作为基准。我肯定它更快,因为它只有4个快速指令。
*** Sandybridge (i5-2500k, so no hyperthreading)
*** 64bit, gcc 5.2 with -O3 -fno-tree-vectorize results:
TODO: update benchmarks for latest code changes
total cycles, and insn/clock, for the test-loop
This measures only throughput, not latency,
and a bottleneck on one execution port might make a function look worse in a microbench
than it will do when mixed with other code that can keep the other ports busy.
Lower numbers in the first column are better:
these are total cycle counts in Megacycles, and correspond to execution time
but they take frequency scaling / turbo out of the mix.
(We're not cache / memory bound at all, so low core clock = fewer cycles for cache miss doesn't matter).
AVX no AVX
887.519Mc 2.70Ipc 887.758Mc 2.70Ipc use_orbits_shift_right
1140.68Mc 2.45Ipc 1140.47Mc 2.46Ipc use_orbits_mul (old version that right-shifted after each)
718.038Mc 2.79Ipc 716.452Mc 2.79Ipc use_orbits_x86_lea
767.836Mc 2.74Ipc 1027.96Mc 2.53Ipc use_orbits_sse2_shift
619.466Mc 2.90Ipc 816.698Mc 2.69Ipc use_orbits_ssse3_shift
845.988Mc 2.72Ipc 845.537Mc 2.72Ipc use_orbits_ssse3_shift_scalar_mmx (gimped by stupid compiler)
583.239Mc 2.92Ipc 686.792Mc 2.91Ipc use_orbits_ssse3_interleave_scalar
547.386Mc 2.92Ipc 730.259Mc 2.88Ipc use_orbits_ssse3_interleave
The fastest (for throughput in a loop) with AVX is orbits_ssse3_interleave
The fastest (for throughput in a loop) without AVX is orbits_ssse3_interleave_scalar
but obits_x86_lea comes very close.
AVX for non-destructive 3-operand vector insns helps a lot
Maybe a bit less important on IvB and later, where mov-elimination handles mov uops at register-rename time
// Tables generated with the following commands:
// for i in avx.perf{{2..4},{6..10}};do awk '/cycles / {c=$1; gsub(",", "", c); } /insns per cy/ {print c / 1000000 "Mc " $4"Ipc"}' *"$i"*;done | column -c 50 -x
// Include 0 and 1 for hosts with pextr
// 5 is omitted because it's not written
The almost certainly best version (with BMI2) is:
几乎可以肯定最好的版本(使用BMI2)是:
#include <stdint.h>
#define LOBITS64 0x5555555555555555ull
#define HIBITS64 0xaaaaaaaaaaaaaaaaull
uint32_t orbits_1pext (uint64_t a) {
// a|a<<1 compiles more efficiently on x86 than a|a>>1, because of LEA for non-destructive left-shift
return _pext_u64( a | a<<1, HIBITS64);
}
This compiles to:
这个编译:
lea rax, [rdi+rdi]
or rdi, rax
movabs rax, -6148914691236517206
pext rax, rdi, rax
ret
So it's only 4 uops, and the critical path latency is 5c = 3(pext) + 1(or) + 1(lea). (Intel Haswell). The throughput should be one result per cycle (with no loop overhead or loading/storing). The mov imm
for the constant can be hoisted out of a loop, though, since it isn't destroyed. This means throughput-wise that we only need 3 fused-domain uops per result.
所以只有4 uops,关键路径延迟是5c = 3(pext) + 1(or) + 1(lea)。(英特尔Haswell)。吞吐量应该是每个周期的一个结果(没有循环开销或装载/存储)。但是,由于mov imm没有被破坏,所以它可以从一个循环中提升出来。这意味着在吞吐量方面,每个结果只需要3个模糊域uops。
The mov r, imm64
isn't ideal. (A 1uop broadcast-immediate 32 or 8bit to a 64bit reg would be ideal, but there's no such instruction). Having the constant in data memory is an option, but inline in the instruction stream is nice. A 64b constant takes a lot of uop-cache space, which makes the version that does pext
with two different masks even worse. Generating one mask from the other with a not
could help with that, though: movabs
/ pext
/ not
/ pext
/ or
, but that's still 5 insns compared to the 4 enabled by the lea
trick.
mov r和imm64并不理想。(1uop广播-立即32或8bit到64位的reg是理想的,但是没有这样的指令)。在数据内存中使用常量是一种选择,但是在指令流中内联是很好的。64b常量会占用大量的uop-cache空间,这使得使用两个不同掩码处理pext的版本更加糟糕。用not生成另一个掩码可能会有帮助,但是:movabs / pext / not / pext / or,但与lea技巧启用的4个insns相比,仍然是5个insns。
The best version (with AVX) is:
最好的版本是:
#include <immintrin.h>
/* Yves Daoust's idea, operating on nibbles instead of bytes:
original:
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs
Q|= Q >> 9; // Intertwine 4 words into 4 bytes
B0= LUT[B0]; B1= LUT[B2]; B2= LUT[B4]; B3= LUT[B6]; // Rearrange bits in bytes
To operate on nibbles,
Q= (Q | (Q << 1)) & 0xAAAAAAAAAAAAL // OR in pairs, same as before
Q|= Q>>5 // Intertwine 8 nibbles into 8 bytes
// pshufb as a LUT to re-order the bits within each nibble (to undo the interleave)
// right-shift and OR to combine nibbles
// pshufb as a byte-shuffle to put the 4 bytes we want into the low 4
*/
uint32_t orbits_ssse3_interleave(uint64_t scalar_a)
{
// do some of this in GP regs if not doing two 64b elements in parallel.
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// but VEX-encoded SSE saves mov instructions
__m128i a = _mm_cvtsi64_si128(scalar_a);
// element size doesn't matter, any bits shifted out of element boundaries would have been masked off anyway.
__m128i lshift = _mm_slli_epi64(a, 1);
lshift = _mm_or_si128(lshift, a);
lshift = _mm_and_si128(lshift, _mm_set1_epi32(0xaaaaaaaaUL));
// a = bits: h g f e d c b a (same thing in other bytes)
// lshift = hg 0 fe 0 dc 0 ba 0
// lshift = s 0 r 0 q 0 p 0
// lshift = s 0 r 0 q 0 p 0
__m128i rshift = _mm_srli_epi64(lshift, 5); // again, element size doesn't matter, we're keeping only the low nibbles
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
__m128i nibbles = _mm_or_si128(rshift, lshift);
nibbles = _mm_and_si128(nibbles, _mm_set1_epi8(0x0f) ); // have to zero the high nibbles: the sign bit affects pshufb
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(-1,-1,-1,-1, 14,12,10,8,
-1,-1,-1,-1, 6, 4, 2,0) );
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
// _mm_extract_epi32(ord_bytes, 2); // If operating on two inputs in parallel: SSE4.1 PEXTRD the result from the upper half of the reg.
}
The best version without AVX is a slight modification that only works with one input at a time, only using SIMD for the shuffling. In theory using MMX instead of SSE would make more sense, esp. if we care about first-gen Core2 where 64b pshufb is fast, but 128b pshufb is not single cycle. Anyway, compilers did a bad job with MMX intrinsics. Also, EMMS is slow.
没有AVX的最佳版本是一个很小的修改,每次只对一个输入起作用,只对变换使用SIMD。在理论上,使用MMX而不是SSE会更有意义,尤其是如果我们关心第一代Core2,其中64b pshufb是快的,但128b pshufb不是单周期。不管怎样,编译器在MMX intrinsic上做得很糟糕。此外,EMMS是缓慢的。
// same as orbits_ssse3_interleave, but doing some of the math in integer regs. (non-vectorized)
// esp. beneficial for AMD Bulldozer-family, where integer and vector ops don't share execution ports
// VEX-encoded SSE saves mov instructions, so full vector is preferable if building with VEX-encoding
// Use MMX for Silvermont/Atom/Merom(Core2): pshufb is slow for xmm, but fast for MMX. Only 64b shuffle unit?
uint32_t orbits_ssse3_interleave_scalar(uint64_t scalar_a)
{
uint64_t lshift = (scalar_a | scalar_a << 1);
lshift &= HIBITS64;
uint64_t rshift = lshift >> 5;
// rshift = s 0 r 0 q 0 p 0 (the last zero ORs with the top bit of the low nibble in the next byte over)
uint64_t nibbles_scalar = (rshift | lshift) & 0x0f0f0f0f0f0f0f0fULL;
// have to zero the high nibbles: the sign bit affects pshufb
__m128i nibbles = _mm_cvtsi64_si128(nibbles_scalar);
// nibbles = 0 0 0 0 q s p r
// pshufb -> 0 0 0 0 s r q p
const __m128i BITORDER_NIBBLE_LUT = _mm_setr_epi8( // setr: first arg goes in the low byte, indexed by 0b0000
0b0000,
0b0100,
0b0001,
0b0101,
0b1000,
0b1100,
0b1001,
0b1101,
0b0010,
0b0110,
0b0011,
0b0111,
0b1010,
0b1110,
0b1011,
0b1111 );
__m128i ord_nibbles = _mm_shuffle_epi8(BITORDER_NIBBLE_LUT, nibbles);
// want 00 00 00 00 AB CD EF GH from:
// ord_nibbles = 0A0B0C0D0E0F0G0H
// 0A0B0C0D0E0F0G0 H(shifted out)
__m128i merged_nibbles = _mm_or_si128(ord_nibbles, _mm_srli_epi64(ord_nibbles, 4));
// merged_nibbles= 0A AB BC CD DE EF FG GH. We want every other byte of this.
// 7 6 5 4 3 2 1 0
// pshufb is the most efficient way. Mask and then packuswb would work, but uses the shuffle port just like pshufb
__m128i ord_bytes = _mm_shuffle_epi8(merged_nibbles, _mm_set_epi8(0,0,0,0, 0,0,0,0, 0,0,0,0, 6,4,2,0));
return _mm_cvtsi128_si32(ord_bytes); // movd the low32 of the vector
}
Sorry for the mostly code-dump answer. At this point I didn't feel like it's worth spending a huge amount of time on discussing things more than the comments already do. See http://agner.org/optimize/ for guides to optimizing for specific microarchitectures. Also the x86 wiki for other resources.
抱歉,我的答案大多是代码转储。在这一点上,我觉得花大量的时间讨论事情比评论已经做的更多是不值得的。请参见http://agner.org/optimize/为特定的微架构优化指南。还有其他资源的x86 wiki。