
时间:2022-12-05 10:00:05

1) On a 32-bit CPU is it faster to acccess an array of 32 boolean values or to access the 32 bits within one word? (Assume we want to check the value of the Nth element and can use either a bit-mask (Nth bit is set) or the integer N as an array index.)

1)在32位CPU上,更快地获得32个布尔值的数组或访问一个字内的32位? (假设我们要检查第N个元素的值,并且可以使用位掩码(第N位设置)或整数N作为数组索引。)

It seems to me that the array would be faster because all common computer architectures natively work at the word level (32 bits, 64 bits, etc., processed in parallel) and accessing the sub-word bits takes extra work.


I know different compilers will represent things differently, but it seems that the underlying hardware architecture would dictate the answer. Or does the answer depend on the language and compiler?


And, 2) Is the speed answer reversed if this array represents a state that I pass between client and server? This question came to mind when reading question "How use bit/bit-operator to control object state?"


P.S. Yes, I could write code to test this myself, but then the SO community wouldn't get to play along!


6 个解决方案


For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.


For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.



Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.



It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .

它取决于编译器和访问模式以及平台。 Raymond Chen有一个很好的成本效益分析:http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx。

Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.


So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)

所以这可能是一场胜利,但你需要了解它的优劣背景。 (无论如何,这是一般性的。)


This is the code generated by 0 != (value & (1 << index)) to test a bit:

这是由0!=(value&(1 << index))生成的代码来测试一下:

00401000  mov         eax,1 
00401005  shl         eax,cl 
00401007  and         eax,1 

And this by values[index] to test a bool[]:

这通过值[index]来测试bool []:

00401000  movzx       eax,byte ptr [ecx+eax]

Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].

无法弄清楚如何在它周围放置一个没有得到优化的循环,我会投票给bool []。


If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.


If you need a better answer than that, write some tests and get back to us.



I think a byte array is probably better than a full-word array for simple random access.


It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.



For question #1: Yes, on most 32-bit platforms, an array of boolean values should be faster, because you will just be loading each 32-bit-aligned value in the array and testing it against 0. If you use a single word, you will have all that work plus the overhead of bit-fiddling.


For question #2: Again, yes, since sending data over a network is significantly slower than operating on data in the CPU and main memory, the overhead of sending even one word will strongly outweigh any performance gain or loss you get by aligning words or bit fiddling.



Bear in mind that a theoretically faster solution that doesn't fit into a cache line might be slower than a theoretically slower one that does, depending on a whole host of things. If this is actually something that needs to be fast, as determined by profiling, test both ways and see. If it doesn't, do whatever looks like cleaner code, which is probably the array.



It depends on the compiler and the access patterns and the platform. Raymond Chen has an excellent cost-benefit analysis: http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx .

它取决于编译器和访问模式以及平台。 Raymond Chen有一个很好的成本效益分析:http://blogs.msdn.com/oldnewthing/archive/2008/11/26/9143050.aspx。

Even on non x86 platforms the use of bits can be prohibitive as at least one PPC platform out there uses microcoded instructions to perform a variable shift which can do nasty things with other hardware threads.


So it can be a win, but you need to understand the context in which it will be good and bad. (Which is a general thing anyway.)

所以这可能是一场胜利,但你需要了解它的优劣背景。 (无论如何,这是一般性的。)


This is the code generated by 0 != (value & (1 << index)) to test a bit:

这是由0!=(value&(1 << index))生成的代码来测试一下:

00401000  mov         eax,1 
00401005  shl         eax,cl 
00401007  and         eax,1 

And this by values[index] to test a bool[]:

这通过值[index]来测试bool []:

00401000  movzx       eax,byte ptr [ecx+eax]

Can't figure out how to put a loop around it that doesn't get optimized away, I'll vote bool[].

无法弄清楚如何在它周围放置一个没有得到优化的循环,我会投票给bool []。


If you are going to check more than one value at a time, doing it in parallel will obviously be faster. If you're only checking one value, it's probably the same.


If you need a better answer than that, write some tests and get back to us.



I think a byte array is probably better than a full-word array for simple random access.


It will give better cache locality than using the full word size, and I don't think byte access is any slower on most/all common architectures.
