快速查找一个值是否存在于C数组中?

时间:2022-08-03 03:47:15

I have an embedded application with a time-critical ISR that needs to iterate through an array of size 256 (preferably 1024, but 256 is the minimum) and check if a value matches the arrays contents. A bool will be set to true is this is the case. MCU is a NXP LPC4357, ARM Cortex M4 core, compiler is GCC. I already have combined optimisation level 2 (3 is slower) and placing the function in RAM instead of flash. I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i!=0 is faster than checking if i<256). All in all, I end up with a duration of 12.5us which has to be reduced drastically to be feasible. This is the (pseudo) code I use now:

我有一个具有时间关键ISR的嵌入式应用程序,它需要遍历256大小的数组(最好是1024,但256是最小的),并检查一个值是否匹配数组内容。一个bool将被设置为true就是这样。MCU是一个NXP LPC4357, ARM皮质M4核,编译是GCC。我已经将优化级别2(3比较慢)和将函数放在RAM中而不是flash中。我还使用指针算法和for循环,它执行向下计数而不是向上计数(检查是否为I !=0比检查i是否<256要快。总而言之,我最终得到的是12.5美元的持续时间,为了可行,必须大幅减少。这是我现在使用的(伪)代码:

uint32_t i;
uint32_t *array_ptr = &theArray[0];
uint32_t compareVal = 0x1234ABCD;
bool validFlag = false;

for (i=256; i!=0; i--)
{
    if (compareVal == *array_ptr++)
    {
         validFlag = true;
         break;
     }
}

What would be the absolute fastest way to do this? Using inline assembly is allowed. Other 'less elegant' tricks also allowed.

最快的方法是什么?允许使用内联程序集。其他“不那么优雅”的技巧也被允许。

14 个解决方案

#1


100  

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

在性能至关重要的情况下,C编译器很可能不会产生与手动调优汇编语言相比的最快代码。我倾向于选择阻力最小的路径——对于像这样的小例程,我只编写asm代码,并且很清楚要执行多少个周期。您可以修改C代码并让编译器生成良好的输出,但是这样做可能会浪费大量时间来调优输出。编译器(特别是来自微软的编译器)在过去的几年里已经取得了长足的进步,但是它们仍然不像编译器在你的耳朵里那样聪明,因为你正在处理你的具体情况,而不仅仅是一个普通的情况。编译器可能不会使用某些指令(例如LDM)来加速循环,而且它也不太可能足够聪明地展开循环。这里有一种方法,它结合了我在注释中提到的3个概念:循环展开、缓存预取和使用多重加载(ldm)指令。指令周期计数每一个数组元素大约有3个时钟,但这没有考虑到内存延迟。

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

操作原理:ARM的CPU设计在一个时钟周期内执行大部分指令,而指令在管道中执行。C编译器将试图通过插入其他指令来消除管道延迟。当遇到像原始C代码那样的紧密循环时,编译器很难隐藏延迟,因为必须立即比较从内存读取的值。下面的代码在2组4个寄存器之间进行切换,以显著减少内存本身和管道获取数据的延迟。一般来说,当处理大型数据集时,您的代码没有使用大部分或全部可用的寄存器时,就不能获得最大的性能。

; r0 = count, r1 = source ptr, r2 = comparison value

   stmfd sp!,{r4-r11}   ; save non-volatile registers
   mov r3,r0,LSR #3     ; loop count = total count / 8
   pld [r1,#128]
   ldmia r1!,{r4-r7}    ; pre load first set
loop_top:
   pld [r1,#128]
   ldmia r1!,{r8-r11}   ; pre load second set
   cmp r4,r2            ; search for match
   cmpne r5,r2          ; use conditional execution to avoid extra branch instructions
   cmpne r6,r2
   cmpne r7,r2
   beq found_it
   ldmia r1!,{r4-r7}    ; use 2 sets of registers to hide load delays
   cmp r8,r2
   cmpne r9,r2
   cmpne r10,r2
   cmpne r11,r2
   beq found_it
   subs r3,r3,#1        ; decrement loop count
   bne loop_top
   mov r0,#0            ; return value = false (not found)
   ldmia sp!,{r4-r11}   ; restore non-volatile registers
   bx lr                ; return
found_it:
   mov r0,#1            ; return true
   ldmia sp!,{r4-r11}
   bx lr

Update: There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

更新:评论中有很多质疑者认为我的经历是轶事/没有价值的,需要证明。我使用了GCC 4.8(来自Android NDK 9C)以优化-O2(包括循环展开的所有优化)生成以下输出。我编译了上面问题中的原始C代码。GCC的产生:

.L9: cmp r3, r0
     beq .L8
.L3: ldr r2, [r3, #4]!
     cmp r2, r1
     bne .L9
     mov r0, #1
.L2: add sp, sp, #1024
     bx  lr
.L8: mov r0, #0
     b .L2

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

GCC的输出不仅不展开循环,而且在LDR之后的失速状态下浪费了一个时钟。每个数组元素至少需要8个时钟。它很好地利用了地址来知道何时退出循环,但是编译器能够做的所有神奇的事情在这段代码中都找不到。我没有在目标平台上运行代码(我没有自己的代码),但是任何有ARM代码性能的人都可以看到我的代码更快。

Update 2: I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

更新2:我给了微软的Visual Studio 2013 SP2一个改进代码的机会。它可以使用氖气指令向量化我的数组初始化,但是OP编写的线性值搜索结果与GCC生成的结果相似(我将标签重新命名以使其更具可读性):

loop_top:
   ldr  r3,[r1],#4  
   cmp  r3,r2  
   beq  true_exit
   subs r0,r0,#1 
   bne  loop_top
false_exit: xxx
   bx   lr
true_exit: xxx
   bx   lr

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

正如我所说,我并不拥有OP的确切硬件,但是我将在3个不同版本中的nVidia Tegra 3和Tegra 4上测试性能,并很快在这里发布结果。

Update 3: I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

更新3:我在Tegra 3和Tegra 4 (Surface RT, Surface RT 2)上运行我的代码和微软编译的ARM代码。

             My Code       MS Code
Surface RT    297ns         562ns
Surface RT 2  172ns         296ns  

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

在这两种情况下,我的代码运行速度几乎是原来的两倍。大多数现代的ARM cpu可能会得出类似的结果。

#2


86  

There's a trick for optimizing it (I was asked this on a job-interview once):

这里有一个优化它的技巧(我曾在一次面试中被问到这个问题):

  • If the last entry in the array holds the value that you're looking for, then return true
  • 如果数组中的最后一个条目保存了要查找的值,那么返回true
  • Write the value that you're looking for into the last entry in the array
  • 将要查找的值写入数组的最后一个条目中
  • Iterate the array until you encounter the value that you're looking for
  • 迭代数组,直到遇到要查找的值
  • If you've encountered it before the last entry in the array, then return true
  • 如果您在数组的最后一个条目之前遇到它,那么返回true。
  • Return false
  • 返回假

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    uint32_t x = theArray[SIZE-1];
    if (x == compareVal)
        return true;
    theArray[SIZE-1] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    theArray[SIZE-1] = x;
    return i != SIZE-1;
}

This yields one branch per iteration instead of two branches per iteration.

这将在每次迭代中产生一个分支,而不是每次迭代产生两个分支。


UPDATE:

更新:

If you're allowed to allocate the array to SIZE+1, then you can get rid of the "last entry swapping" part:

如果允许将数组分配到+1大小,那么可以去掉“最后一项交换”部分:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    theArray[SIZE] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    return i != SIZE;
}

You can also get rid of the additional arithmetic embedded in theArray[i], using the following instead:

你也可以去掉嵌在数组[i]中的附加算法,取而代之的是:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t *arrayPtr;
    theArray[SIZE] = compareVal;
    for (arrayPtr = theArray; *arrayPtr != compareVal; arrayPtr++);
    return arrayPtr != theArray+SIZE;
}

If the compiler doesn't already apply it, then this function will do so for sure. On the other hand, it might make it harder on the optimizer to unroll the loop, so you will have to verify that in the generated assembly code...

如果编译器还没有应用它,那么这个函数肯定会这样做。另一方面,它可能使优化器更难展开循环,因此您必须在生成的汇编代码中验证这一点……

#3


61  

You're asking for help with optimising your algorithm, which may push you to assembler. But your algorithm (a linear search) is not so clever, so you should consider changing your algorithm. E.g.:

您正在请求帮助优化算法,这可能会推动您进行汇编。但是你的算法(线性搜索)不是那么聪明,所以你应该考虑改变你的算法。例如:

Perfect hash function

If your 256 "valid" values are static and known at compile time, then you can use a perfect hash function. You need to find a hash function that maps your input value to a value in the range 0..n, where there are no collisions for all the valid values you care about. That is, no two "valid" values hash to the same output value. When searching for a good hash function, you aim to:

如果您的256个“有效”值是静态的,并且在编译时是已知的,那么您可以使用一个完美的散列函数。您需要找到一个散列函数,将输入值映射到范围为0的值。n,对于所有你关心的有效值没有碰撞。也就是说,没有两个“有效”值散列到相同的输出值。在搜索一个好的散列函数时,您的目标是:

  • Keep the hash function reasonably fast.
  • 使哈希函数保持合理的速度。
  • Minimise n. The smallest you can get is 256 (minimal perfect hash function), but that's probably hard to achieve, depending on the data.
  • 最小化n.你能得到的最小值是256(最小完美哈希函数),但这可能很难实现,这取决于数据。

Note for efficient hash functions, n is often a power of 2, which is equivalent to a bitwise mask of low bits (AND operation). Example hash functions:

注意,对于高效的哈希函数,n通常是2的幂,相当于一个低比特位(和操作)的位掩码。哈希函数示例:

  • CRC of input bytes, modulo n.
  • 输入字节的CRC,模n。
  • ((x << i) ^ (x >> j) ^ (x << k) ^ ...) % n (picking as many i, j, k, ... as needed, with left or right shifts)
  • ((x < <我)^(x> > j)^(x < < k)^…)% n(选择尽可能多的i,j,k,…根据需要,使用左或右移位)

Then you make a fixed table of n entries, where the hash maps the input values to an index i into the table. For valid values, table entry i contains the valid value. For all other table entries, ensure that each entry of index i contains some other invalid value which doesn't hash to i.

然后创建一个包含n个条目的固定表,其中散列将输入值映射到表中的索引i。对于有效值,表项包含有效值。对于所有其他表条目,确保索引i的每个条目包含一些其他无效值,这些值不会哈希到i。

Then in your interrupt routine, with input x:

然后在你的中断程序中,输入x:

  1. Hash x to index i (which is in the range 0..n)
  2. 哈希x到索引i(范围为0..n)
  3. Look up entry i in the table and see if it contains the value x.
  4. 查找表中的条目i,看看它是否包含值x。

This will be much faster than a linear search of 256 or 1024 values.

这将比线性搜索256或1024的值要快得多。

I've written some Python code to find reasonable hash functions.

我编写了一些Python代码来查找合理的散列函数。

Binary search

If you sort your array of 256 "valid" values, then you can do a binary search, rather than a linear search. That means you should be able to search 256-entry table in only 8 steps (log2(256)), or a 1024-entry table in 10 steps. Again, this will be much faster than a linear search of 256 or 1024 values.

如果对256个“有效”值的数组进行排序,那么可以进行二进制搜索,而不是线性搜索。这意味着您应该能够在8个步骤中搜索256-entry表(log2(256)),或者在10个步骤中搜索1024-entry表。同样,这比线性搜索256或1024的值要快得多。

#4


59  

Keep the table in sorted order, and use Bentley's unrolled binary search:

保持表格的排序顺序,使用本特利的展开二进制搜索:

i = 0;
if (key >= a[i+512]) i += 512;
if (key >= a[i+256]) i += 256;
if (key >= a[i+128]) i += 128;
if (key >= a[i+ 64]) i +=  64;
if (key >= a[i+ 32]) i +=  32;
if (key >= a[i+ 16]) i +=  16;
if (key >= a[i+  8]) i +=   8;
if (key >= a[i+  4]) i +=   4;
if (key >= a[i+  2]) i +=   2;
if (key >= a[i+  1]) i +=   1;
return (key == a[i]);

The point is,

关键是,

  • if you know how big the table is, then you know how many iterations there will be, so you can fully unroll it.
  • 如果您知道这个表有多大,那么您就知道会有多少次迭代,因此您可以完全展开它。
  • Then, there's no point testing for the == case on each iteration because, except on the last iteration, the probability of that case is too low to justify spending time testing for it.**
  • 然后,在每个迭代中都没有对== case的点测试,因为,除了上一个迭代之外,该情况的概率太低,没有理由花时间对它进行测试
  • Finally, by expanding the table to a power of 2, you add at most one comparison, and at most a factor of two storage.
  • 最后,通过将表扩展到2的幂,最多添加一个比较,最多添加两个存储的因数。

** If you're not used to thinking in terms of probabilities, every decision point has an entropy, which is the average information you learn by executing it. For the >= tests, the probability of each branch is about 0.5, and -log2(0.5) is 1, so that means if you take one branch you learn 1 bit, and if you take the other branch you learn one bit, and the average is just the sum of what you learn on each branch times the probability of that branch. So 1*0.5 + 1*0.5 = 1, so the entropy of the >= test is 1. Since you have 10 bits to learn, it takes 10 branches. That's why it's fast!

**如果你不习惯用概率来思考,那么每个决策点都有一个熵,它是你通过执行它而获得的平均信息。> =的测试,每个分支的概率大约是0.5,和- log2(0.5)是1,这意味着如果你把一个分支你学习1位,如果你把其他分支你学习一点,和平均只是你学习在每个分支的总和乘以该分支的概率。所以1*0。5 + 1*0。5 = 1,所以>= test的熵是1。既然你有10位要学习,那就需要10个分支。这就是为什么快!

On the other hand, what if your first test is if (key == a[i+512)? The probability of being true is 1/1024, while the probability of false is 1023/1024. So if it's true you learn all 10 bits! But if it's false you learn -log2(1023/1024) = .00141 bits, practically nothing! So the average amount you learn from that test is 10/1024 + .00141*1023/1024 = .0098 + .00141 = .0112 bits. About one hundredth of a bit. That test is not carrying its weight!

另一方面,如果你的第一个测试是if (key = a[i+512])呢?正确的概率是1/1024,而false的概率是1023/1024。所以如果这是真的,你会学到所有的10比特!但是如果它是假的,你会学到-log2(1023/1024) =。00141位,实际上什么都没有!你从这个测试中学到的平均数量是10/1024 +。00141*1023/1024 =。0098 +。00141 =。0112位。大约百分之一。那个测试没有承载它的重量!

#5


16  

If the set of constants in your table is known in advance, you can use perfect hashing to ensure that only one access is made to the table. Perfect hashing determines a hash function that maps every interesting key to a unique slot (that table isn't always dense, but you can decide how un-dense a table you can afford, with less dense tables typically leading to simpler hashing functions).

如果您的表中的常量集是预先已知的,您可以使用完美的散列来确保只有一个对表的访问。完美散列确定一个散列函数,该函数将每个有趣的键映射到一个唯一的槽(该表并不总是稠密的,但是您可以决定您能负担得起的表的非稠密程度,而较低的表通常导致更简单的散列函数)。

Usually, the perfect hash function for the specific set of keys is relatively easy to compute; you don't want that to be long and complicated because that competes for time perhaps better spent doing multiple probes.

通常,特定键集的完美哈希函数比较容易计算;您不希望它太长、太复杂,因为这样可能会浪费时间,最好是花时间进行多个探测。

Perfect hashing is a "1-probe max" scheme. One can generalize the idea, with the thought that one should trade simplicity of computing the hash code with the time it takes to make k probes. After all, the goal is "least total time to look up", not fewest probes or simplest hash function. However, I've never seen anybody build a k-probes-max hashing algorithm. I suspect one can do it, but that's likely research.

完美哈希是一个“1探针max”方案。一个人可以概括这个想法,认为一个人应该用简单的方法来计算哈希代码,用时间来做k个探针。毕竟,目标是“查找的总时间最少”,而不是最少的探测或最简单的散列函数。但是,我从来没有见过任何人构建一个k-probes-max散列算法。我怀疑有人能做到,但那很可能是研究。

One other thought: if your processor is extremely fast, the one probe to memory from a perfect hash probably dominates the execution time. If the processor is not very fast, than k>1 probes might be practical.

另一个想法是:如果您的处理器非常快,那么来自完美散列的一个探针可能会控制执行时间。如果处理器不是非常快,比k>1探测器可能更实用。

#6


13  

Use a hash set. It will give O(1) lookup time.

使用散列集。它将提供O(1)查找时间。

The following code assumes that you can reserve value 0 as an 'empty' value, i.e. not occurring in actual data. The solution can be expanded for a situation where this is not the case.

下面的代码假设您可以将值0保留为“空”值,即不出现在实际数据中。在这种情况下,可以扩展解决方案。

#define HASH(x) (((x >> 16) ^ x) & 1023)
#define HASH_LEN 1024
uint32_t my_hash[HASH_LEN];

int lookup(uint32_t value)
{
    int i = HASH(value);
    while (my_hash[i] != 0 && my_hash[i] != value) i = (i + 1) % HASH_LEN;
    return i;
}

void store(uint32_t value)
{
    int i = lookup(value);
    if (my_hash[i] == 0)
       my_hash[i] = value;
}

bool contains(uint32_t value)
{
    return (my_hash[lookup(value)] == value);
}

In this example implementation, the lookup time will typically be very low, but at the worst case can be up to the number of entries stored. For a realtime application, you can consider also an implementation using binary trees, which will have a more predictable lookup time.

在这个示例实现中,查找时间通常非常短,但在最坏的情况下,查询时间最多可以达到存储的条目数量。对于实时应用程序,您还可以考虑使用二叉树的实现,它将具有更可预测的查找时间。

#7


9  

In this case, it might be worthwhile investigating bloom filters. They're capable of quickly establishing that a value is not present, which is a good thing since most of the 2^32 possible values are not in that 1024 element array. However, there are some false positives that will need an extra check.

在这种情况下,研究bloom filter可能是值得的。他们能够迅速建立一个值不存在,这是一件好事,因为大多数的2 ^ 32可能值不在1024元素数组。然而,有一些假阳性需要额外的检查。

Since your table is apparently static, you can determine which false positives exist for your Bloom filter and put those ins a perfect hash.

由于表显然是静态的,所以您可以确定Bloom filter存在哪些假阳性,并将它们放入一个完美的散列中。

#8


8  

Assuming your processor runs at 204 MHz which seems to be the maximum for the LPC4357, and also assuming your timing result reflects the average case (half of the array traversed), we get:

假设你的处理器运行在204 MHz,这似乎是LPC4357的最大值,同时假设你的时间结果反映了平均情况(遍历的数组的一半),我们得到:

  • CPU frequency: 204 MHz
  • CPU频率:204 MHz
  • Cycle period: 4.9 ns
  • 循环周期:4.9 ns
  • Duration in cycles: 12.5 µs / 4.9 ns = 2551 cycles
  • 时间周期:12.5µs / 4.9 ns = 2551周期
  • Cycles per iteration: 2551 / 128 = 19.9
  • 每次迭代周期:2551 / 128 = 19.9

So, your search loop spends around 20 cycles per iteration. That doesn't sound awful, but I guess that in order to make it faster you need to look at the assembly.

所以,你的搜索循环每次迭代大约花费20个周期。这听起来并不可怕,但我想为了让它更快,你需要看一下程序集。

I would recommend dropping the index and using a pointer comparison instead, and making all the pointers const.

我建议删除索引并使用指针比较,并使用所有指针const。

bool arrayContains(const uint32_t *array, size_t length)
{
  const uint32_t * const end = array + length;
  while(array != end)
  {
    if(*array++ == 0x1234ABCD)
      return true;
  }
  return false;
}

That's at least worth testing.

这至少值得一试。

#9


6  

Other people have suggested reorganizing your table, adding a sentinel value at the end, or sorting it in order to provide a binary search.

其他人建议重新组织您的表,在末尾添加一个前哨值,或者对其进行排序以提供二进制搜索。

You state "I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i != 0 is faster than checking if i < 256)."

您可以声明“我还使用指针算法和for循环,它执行向下计数而不是向上计数(检查I != 0是否比检查I < 256更快)。”

My first advice is: get rid of the pointer arithmetic and the downcounting. Stuff like

我的第一个建议是:去掉指针运算和减少计数。之类的

for (i=0; i<256; i++)
{
    if (compareVal == the_array[i])
    {
       [...]
    }
}

tends to be idiomatic to the compiler. The loop is idiomatic, and the indexing of an array over a loop variable is idiomatic. Juggling with pointer arithmetic and pointers will tend to obfuscate the idioms to the compiler and make it generate code related to what you wrote rather than what the compiler writer decided to be the best course for the general task.

对编译器来说是习惯用法。循环是惯用的,数组对循环变量的索引也是惯用的。处理指针算术和指针会使编译器混淆习语,并使它生成与您所编写的内容相关的代码,而不是编译器编写人员所决定的一般任务的最佳过程。

For example, the above code might be compiled into a loop running from -256 or -255 to zero, indexing off &the_array[256]. Possibly stuff that is not even expressible in valid C but matches the architecture of the machine you are generating for.

例如,上面的代码可以编译成一个循环,从-256或-255到0,索引off &the_array[256]。可能是在有效的C语言中无法表达的东西,但与您正在生成的机器的体系结构相匹配。

So don't microoptimize. You are just throwing spanners into the works of your optimizer. If you want to be clever, work on the data structures and algorithms but don't microoptimize their expression. It will just come back to bite you, if not on the current compiler/architecture, then on the next.

所以不要microoptimize。您只是在优化器的工作中添加了扳手。如果你想变得聪明,可以研究数据结构和算法,但不要对它们的表达式进行微优化。它会回来咬你一口,如果不是在当前的编译器/体系结构上,那么在下一个。

In particular using pointer arithmetic instead of arrays and indexes is poison for the compiler being fully aware of alignments, storage locations, aliasing considerations and other stuff, and for doing optimizations like strength reduction in the way best suited to the machine architecture.

特别是使用指针算法而不是数组和索引对编译器来说是有害的,因为编译器完全了解对齐、存储位置、别名考虑和其他东西,并且以最适合机器架构的方式进行强度降低等优化。

#10


3  

Vectorization can be used here, as it is often is in implementations of memchr. You use the following algorithm:

矢量化可以在这里使用,因为它通常是在memchr的实现中。使用以下算法:

  1. Create a mask of your query repeating, equal in length to your OS'es bit count (64-bit, 32-bit, etc.). On a 64-bit system you would repeat the 32-bit query twice.

    创建查询重复的掩码,长度与OS的位计数相等(64位、32位等)。在64位系统上,您将重复32位查询两次。

  2. Process the list as a list of multiple pieces of data at once, simply by casting the list to a list of a larger data type and pulling values out. For each chunk, XOR it with the mask, then XOR with 0b0111...1, then add 1, then & with a mask of 0b1000...0 repeating. If the result is 0, there is definitely not a match. Otherwise, there may (usually with very high probability) be a match, so search the chunk normally.

    将列表一次作为多个数据片段的列表进行处理,只需将列表转换为更大的数据类型的列表并从中提取值。对于每个块,XOR与掩码,然后XOR与0b0111…然后加上1,然后加上一个0b1000的面具…0重复。如果结果为0,则肯定不存在匹配。否则,可能会有一个匹配(通常是非常高的概率),所以正常地搜索块。

Example implementation: https://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/memchr.c?rev=1.3&content-type=text/x-cvsweb-markup&cvsroot=src

示例实现:https://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/memchr.c?rev=1.3&content-type=text/x-cvsweb-markup&cvsroot=src

#11


3  

If you can accommodate the domain of your values with the amount of memory that's available to your application, then, the fastest solution would be to represent your array as an array of bits:

如果您可以将您的值的域与应用程序可用的内存数量相匹配,那么,最快的解决方案将是将您的数组表示成一个比特数组:

bool theArray[MAX_VALUE]; // of which 1024 values are true, the rest false
uint32_t compareVal = 0x1234ABCD;
bool validFlag = theArray[compareVal];

EDIT

编辑

I'm astounded by the number of critics. The title of this thread is "How do I quickly find whether a value is present in a C array?" for which I will stand by my answer because it answers precisely that. I could argue that this has the most speed efficient hash function (since address === value). I've read the comments and I'm aware of the obvious caveats. Undoubtedly those caveats limit the range of problems this can be used to solve, but, for those problems that it does solve, it solves very efficiently.

我被这么多的批评惊呆了。这个线程的标题是“如何快速找到C数组中的值?”我可以说这是最快速有效的哈希函数(因为address ===值)。我读过这些评论,也意识到了一些明显的警告。毫无疑问,这些警告限制了可用来解决的问题的范围,但是,对于它确实解决的问题,它非常有效地解决了。

Rather than reject this answer outright, consider it as the optimal starting point for which you can evolve by using hash functions to achieve a better balance between speed and performance.

与其直接拒绝这个答案,不如把它看作是您可以通过使用散列函数在速度和性能之间实现更好的平衡的最佳起点。

#12


0  

I'm sorry if my answer was already answered - just I'm a lazy reader. Feel you free to downvote then ))

如果我的答案已经回答了,我很抱歉——只是我是个懒读者。觉得你可以投反对票)

1) you could remove counter 'i' at all - just compare pointers, ie

1)你可以删除计数器“i”——只需要比较指针即可

for (ptr = &the_array[0]; ptr < the_array+1024; ptr++)
{
    if (compareVal == *ptr)
    {
       break;
    }
}
... compare ptr and the_array+1024 here - you do not need validFlag at all.

all that won't give any significant improvement though, such optimization probably could be achieved by the compiler itself.

所有这些都不会带来任何显著的改进,但是这样的优化很可能是由编译器本身实现的。

2) As it was already mentioned by other answers, almost all modern CPU are RISC-based, for example ARM. Even modern Intel X86 CPUs use RISC cores inside, as far as I know (compiling from X86 on fly). Major optimization for RISC is pipeline optimization (and for Intel and other CPU as well), minimizing code jumps. One type of such optimization (probably a major one), is "cycle rollback" one. It's incredibly stupid, and efficient, even Intel compiler can do that AFAIK. It looks like:

2)正如其他答案中已经提到的,几乎所有现代CPU都是基于ria的,例如ARM。就我所知,即使是现代的Intel X86 cpu也使用RISC内核(从X86上进行编译)。RISC的主要优化是管道优化(以及Intel和其他CPU),最小化代码跳转。这种优化的一种类型(可能是主要的一种)是“循环回滚”。这是非常愚蠢和高效的,即使是英特尔编译器也能做到这一点。它看起来像:

if (compareVal == the_array[0]) { validFlag = true; goto end_of_compare; }
if (compareVal == the_array[1]) { validFlag = true; goto end_of_compare; }
...and so on...
end_of_compare:

This way the optimization is that the pipeline is not broken for the worst case (if compareVal is absent in the array), so it is as fast as possible (of course not counting algorithm optimizations such as hash tables, sorted arrays and so on, mentioned in other answers, which may give better results depending on array size. Cycles Rollback approach can be applied there as well by the way. I'm writing here about that I think I didn't see in others)

这种方式的优化管道不破碎的最坏的情况下(如果compareVal不在数组中),所以尽可能快(当然不包括如哈希表、算法优化排序数组等等,提到在其他的答案,这可能会给更好的结果根据数组的大小。循环回滚方法也可以在那里应用。我在这里写的是我在其他人身上没有看到的

The second part of this optimization is that that array item is taken by direct address (calculated at compiling stage, make sure you use a static array), and do not need additional ADD op to calculate pointer from array's base address. This optimization may not have significant effect, since AFAIK ARM architecture has special features to speed up arrays addressing. But anyway it's always better to know that you did all the best just in C code directly, right?

此优化的第二部分是,数组项是通过直接地址获取的(在编译阶段计算,确保使用静态数组),并且不需要额外的添加op来从数组的基本地址计算指针。这种优化可能没有显著的效果,因为AFAIK ARM架构具有加速数组寻址的特殊特性。但无论如何,最好是直接在C代码中完成所有最好的操作,对吧?

Cycle Rollback may look awkward due to waste of ROM (yep, you did right placing it to fast part of RAM, if your board supports this feature), but actually it's a fair pay for speed, being based on RISC concept. This is just a general point of calculation optimization - you sacrifice space for sake of speed, and vice versa, depending on your requirements.

由于ROM的浪费,循环回滚可能看起来很尴尬(是的,如果您的董事会支持这个特性,您将它放置到RAM的快速部分是正确的),但是实际上,基于RISC概念,这是对速度的公平补偿。这只是计算优化的一个一般点——为了速度而牺牲空间,反之亦然,这取决于您的需求。

If you think that rollback for array of 1024 elements is too large sacrifice for your case, you can consider 'partial rollback', for example dividing the array into 2 parts of 512 items each, or 4x256, and so on.

如果您认为对1024个元素的数组进行回滚的代价太大,可以考虑“部分回滚”,例如将数组分成512个元素的2个部分,或者4x256,等等。

3) modern CPU often support SIMD ops, for example ARM NEON instruction set - it allows to execute the same ops in parallel. Frankly speaking I do not remember if it is suitable for comparison ops, but I feel it may be, you should check that. Googling shows that there may be some tricks as well, to get max speed, see https://*.com/a/5734019/1028256

3)现代CPU通常支持SIMD操作,例如ARM NEON指令集——它允许并行执行相同的操作。坦白地说,我不记得它是否适合比较操作,但我觉得可能是,你应该检查一下。谷歌搜索也显示了一些技巧,要获得最大速度,请参见https://*.com/a/5734019/1028256

I hope it can give you some new ideas.

我希望它能给你一些新的想法。

#13


0  

This is more like an addendum than an answer.

这更像是附录而不是答案。

I've had a simillar case in the past, but my array was constant over a considerable number of searches.

我以前遇到过一个类似的情况,但是我的数组在相当多的搜索中是常量。

In half of them, the searched value was NOT present in array. Then I realized I could apply a "filter" before doing any search.

在其中的一半中,搜索值不在数组中。然后我意识到我可以在做任何搜索之前应用一个“过滤器”。

This "filter" is just a simple integer number, calculated ONCE and used in each search.

这个“过滤器”只是一个简单的整数,计算一次并在每次搜索中使用。

It's in java, but it's pretty simple:

它在java中,但很简单:

binaryfilter = 0;
for (int i = 0; i < array.length; i++)
{
    // just apply "Binary OR Operator" over values.
    binaryfilter = binaryfilter | array[i]; 
}

So, before do a binary search, I check binaryfilter:

在进行二进制搜索之前,我先检查一下binaryfilter:

// check binaryfilter vs value with a "Binary AND Operator"
if ( (binaryfilter & valuetosearch) != valuetosearch) 
{
    // valuetosearch is not in the array!
    return false;
}
else
{
    // valuetosearch MAYBE in the array, so let's check it out
    // ... do binary search stuff ...

}

You can use a 'better' hash algorithm, but this can be very fast, specially for large numbers. May be this could save you even more cycles.

您可以使用“更好的”哈希算法,但这可能非常快,特别是对于大量的数据。也许这可以为你节省更多的周期。

#14


0  

I'm a great fan of hashing. The problem of course is to find an efficient algorithm that is both fast and uses a minimum amount of memory (especially on an embedded processor).

我非常喜欢散列。当然,问题是要找到一种既快又使用最少内存(特别是在嵌入式处理器上)的高效算法。

If you know beforehand the values that may occur you can create a program that runs through a multitude of algorithms to find the best one - or, rather, the best parameters for your data.

如果您事先知道可能出现的值,您可以创建一个程序,该程序通过大量的算法来找到最好的一个,或者更确切地说,为您的数据找到最好的参数。

I created such a program that you can read about in this post and achieved some very fast results. 16000 entries translates roughly to 2^14 or an average of 14 comparisons to find the value using a binary search. I explicitly aimed for very fast lookups - on average finding the value in <=1.5 lookups - which resulted in greater RAM requirements. I believe that with a more conservative average value (say <=3) a lot of memory could be saved. By comparison the average case for a binary search on your 256 or 1024 entries would result in an average number of comparisons of 8 and 10, respectively.

我创建了这样一个程序,你可以在这篇文章中看到,并取得了一些非常快的结果。大约16000个条目翻译2 ^ 14或平均14比较发现使用二进制搜索的值。我明确的目标是非常快速的查找——平均查找<=1.5查找的值——这会导致更大的RAM需求。我认为,使用更保守的平均值(比如<=3)可以节省大量内存。通过比较,在256或1024个条目上进行二进制搜索的平均情况将分别导致8和10的平均比较数。

My average lookup required around 60 cycles (on a laptop with an intel i5) with a generic algorithm (utilizing one division by a variable) and 40-45 cycles with a specialized (probably utilizing a multiplication). This should translate into sub-microsecond lookup times on your MCU, depending of course on the clock frequency it executes at.

我的平均查找需要大约60个周期(在使用intel i5的笔记本电脑上)和一个通用算法(用一个变量除以一个变量)以及40-45个周期(可能用一个乘法)。这将转化为您的MCU上的亚微秒查找时间,当然这取决于它执行的时钟频率。

It can be real-life-tweaked further if the entry array keeps track of how many times an entry was accessed. If the entry array is sorted from most to least accessed before the indeces are computed then it'll find the most commonly occuring values with a single comparison.

如果条目数组跟踪一个条目被访问了多少次,则可以对其进行实际的进一步调整。如果在计算indeces之前,条目数组被从最多访问到最少访问进行排序,那么它将通过单个比较找到最常见的出现值。

#1


100  

In situations where performance is of utmost importance, the C compiler will most likely not produce the fastest code compared to what you can do with hand tuned assembly language. I tend to take the path of least resistance - for small routines like this, I just write asm code and have a good idea how many cycles it will take to execute. You may be able to fiddle with the C code and get the compiler to generate good output, but you may end up wasting lots of time tuning the output that way. Compilers (especially from Microsoft) have come a long way in the last few years, but they are still not as smart as the compiler between your ears because you're working on your specific situation and not just a general case. The compiler may not make use of certain instructions (e.g. LDM) that can speed this up, and it's unlikely to be smart enough to unroll the loop. Here's a way to do it which incorporates the 3 ideas I mentioned in my comment: Loop unrolling, cache prefetch and making use of the multiple load (ldm) instruction. The instruction cycle count comes out to about 3 clocks per array element, but this doesn't take into account memory delays.

在性能至关重要的情况下,C编译器很可能不会产生与手动调优汇编语言相比的最快代码。我倾向于选择阻力最小的路径——对于像这样的小例程,我只编写asm代码,并且很清楚要执行多少个周期。您可以修改C代码并让编译器生成良好的输出,但是这样做可能会浪费大量时间来调优输出。编译器(特别是来自微软的编译器)在过去的几年里已经取得了长足的进步,但是它们仍然不像编译器在你的耳朵里那样聪明,因为你正在处理你的具体情况,而不仅仅是一个普通的情况。编译器可能不会使用某些指令(例如LDM)来加速循环,而且它也不太可能足够聪明地展开循环。这里有一种方法,它结合了我在注释中提到的3个概念:循环展开、缓存预取和使用多重加载(ldm)指令。指令周期计数每一个数组元素大约有3个时钟,但这没有考虑到内存延迟。

Theory of operation: ARM's CPU design executes most instructions in one clock cycle, but the instructions are executed in a pipeline. C compilers will try to eliminate the pipeline delays by interleaving other instructions in between. When presented with a tight loop like the original C code, the compiler will have a hard time hiding the delays because the value read from memory must be immediately compared. My code below alternates between 2 sets of 4 registers to significantly reduce the delays of the memory itself and the pipeline fetching the data. In general, when working with large data sets and your code doesn't make use of most or all of the available registers, then you're not getting maximum performance.

操作原理:ARM的CPU设计在一个时钟周期内执行大部分指令,而指令在管道中执行。C编译器将试图通过插入其他指令来消除管道延迟。当遇到像原始C代码那样的紧密循环时,编译器很难隐藏延迟,因为必须立即比较从内存读取的值。下面的代码在2组4个寄存器之间进行切换,以显著减少内存本身和管道获取数据的延迟。一般来说,当处理大型数据集时,您的代码没有使用大部分或全部可用的寄存器时,就不能获得最大的性能。

; r0 = count, r1 = source ptr, r2 = comparison value

   stmfd sp!,{r4-r11}   ; save non-volatile registers
   mov r3,r0,LSR #3     ; loop count = total count / 8
   pld [r1,#128]
   ldmia r1!,{r4-r7}    ; pre load first set
loop_top:
   pld [r1,#128]
   ldmia r1!,{r8-r11}   ; pre load second set
   cmp r4,r2            ; search for match
   cmpne r5,r2          ; use conditional execution to avoid extra branch instructions
   cmpne r6,r2
   cmpne r7,r2
   beq found_it
   ldmia r1!,{r4-r7}    ; use 2 sets of registers to hide load delays
   cmp r8,r2
   cmpne r9,r2
   cmpne r10,r2
   cmpne r11,r2
   beq found_it
   subs r3,r3,#1        ; decrement loop count
   bne loop_top
   mov r0,#0            ; return value = false (not found)
   ldmia sp!,{r4-r11}   ; restore non-volatile registers
   bx lr                ; return
found_it:
   mov r0,#1            ; return true
   ldmia sp!,{r4-r11}
   bx lr

Update: There are a lot of skeptics in the comments who think that my experience is anecdotal/worthless and require proof. I used GCC 4.8 (from the Android NDK 9C) to generate the following output with optimization -O2 (all optimizations turned on including loop unrolling). I compiled the original C code presented in the question above. Here's what GCC produced:

更新:评论中有很多质疑者认为我的经历是轶事/没有价值的,需要证明。我使用了GCC 4.8(来自Android NDK 9C)以优化-O2(包括循环展开的所有优化)生成以下输出。我编译了上面问题中的原始C代码。GCC的产生:

.L9: cmp r3, r0
     beq .L8
.L3: ldr r2, [r3, #4]!
     cmp r2, r1
     bne .L9
     mov r0, #1
.L2: add sp, sp, #1024
     bx  lr
.L8: mov r0, #0
     b .L2

GCC's output not only doesn't unroll the loop, but also wastes a clock on a stall after the LDR. It requires at least 8 clocks per array element. It does a good job of using the address to know when to exit the loop, but all of the magical things compilers are capable of doing are nowhere to be found in this code. I haven't run the code on the target platform (I don't own one), but anyone experienced in ARM code performance can see that my code is faster.

GCC的输出不仅不展开循环,而且在LDR之后的失速状态下浪费了一个时钟。每个数组元素至少需要8个时钟。它很好地利用了地址来知道何时退出循环,但是编译器能够做的所有神奇的事情在这段代码中都找不到。我没有在目标平台上运行代码(我没有自己的代码),但是任何有ARM代码性能的人都可以看到我的代码更快。

Update 2: I gave Microsoft's Visual Studio 2013 SP2 a chance to do better with the code. It was able to use NEON instructions to vectorize my array initialization, but the linear value search as written by the OP came out similar to what GCC generated (I renamed the labels to make it more readable):

更新2:我给了微软的Visual Studio 2013 SP2一个改进代码的机会。它可以使用氖气指令向量化我的数组初始化,但是OP编写的线性值搜索结果与GCC生成的结果相似(我将标签重新命名以使其更具可读性):

loop_top:
   ldr  r3,[r1],#4  
   cmp  r3,r2  
   beq  true_exit
   subs r0,r0,#1 
   bne  loop_top
false_exit: xxx
   bx   lr
true_exit: xxx
   bx   lr

As I said, I don't own the OP's exact hardware, but I will be testing the performance on an nVidia Tegra 3 and Tegra 4 of the 3 different versions and post the results here soon.

正如我所说,我并不拥有OP的确切硬件,但是我将在3个不同版本中的nVidia Tegra 3和Tegra 4上测试性能,并很快在这里发布结果。

Update 3: I ran my code and Microsoft's compiled ARM code on a Tegra 3 and Tegra 4 (Surface RT, Surface RT 2). I ran 1000000 iterations of a loop which fails to find a match so that everything is in cache and it's easy to measure.

更新3:我在Tegra 3和Tegra 4 (Surface RT, Surface RT 2)上运行我的代码和微软编译的ARM代码。

             My Code       MS Code
Surface RT    297ns         562ns
Surface RT 2  172ns         296ns  

In both cases my code runs almost twice as fast. Most modern ARM CPUs will probably give similar results.

在这两种情况下,我的代码运行速度几乎是原来的两倍。大多数现代的ARM cpu可能会得出类似的结果。

#2


86  

There's a trick for optimizing it (I was asked this on a job-interview once):

这里有一个优化它的技巧(我曾在一次面试中被问到这个问题):

  • If the last entry in the array holds the value that you're looking for, then return true
  • 如果数组中的最后一个条目保存了要查找的值,那么返回true
  • Write the value that you're looking for into the last entry in the array
  • 将要查找的值写入数组的最后一个条目中
  • Iterate the array until you encounter the value that you're looking for
  • 迭代数组,直到遇到要查找的值
  • If you've encountered it before the last entry in the array, then return true
  • 如果您在数组的最后一个条目之前遇到它,那么返回true。
  • Return false
  • 返回假

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    uint32_t x = theArray[SIZE-1];
    if (x == compareVal)
        return true;
    theArray[SIZE-1] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    theArray[SIZE-1] = x;
    return i != SIZE-1;
}

This yields one branch per iteration instead of two branches per iteration.

这将在每次迭代中产生一个分支,而不是每次迭代产生两个分支。


UPDATE:

更新:

If you're allowed to allocate the array to SIZE+1, then you can get rid of the "last entry swapping" part:

如果允许将数组分配到+1大小,那么可以去掉“最后一项交换”部分:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t i;
    theArray[SIZE] = compareVal;
    for (i = 0; theArray[i] != compareVal; i++);
    return i != SIZE;
}

You can also get rid of the additional arithmetic embedded in theArray[i], using the following instead:

你也可以去掉嵌在数组[i]中的附加算法,取而代之的是:

bool check(uint32_t theArray[], uint32_t compareVal)
{
    uint32_t *arrayPtr;
    theArray[SIZE] = compareVal;
    for (arrayPtr = theArray; *arrayPtr != compareVal; arrayPtr++);
    return arrayPtr != theArray+SIZE;
}

If the compiler doesn't already apply it, then this function will do so for sure. On the other hand, it might make it harder on the optimizer to unroll the loop, so you will have to verify that in the generated assembly code...

如果编译器还没有应用它,那么这个函数肯定会这样做。另一方面,它可能使优化器更难展开循环,因此您必须在生成的汇编代码中验证这一点……

#3


61  

You're asking for help with optimising your algorithm, which may push you to assembler. But your algorithm (a linear search) is not so clever, so you should consider changing your algorithm. E.g.:

您正在请求帮助优化算法,这可能会推动您进行汇编。但是你的算法(线性搜索)不是那么聪明,所以你应该考虑改变你的算法。例如:

Perfect hash function

If your 256 "valid" values are static and known at compile time, then you can use a perfect hash function. You need to find a hash function that maps your input value to a value in the range 0..n, where there are no collisions for all the valid values you care about. That is, no two "valid" values hash to the same output value. When searching for a good hash function, you aim to:

如果您的256个“有效”值是静态的,并且在编译时是已知的,那么您可以使用一个完美的散列函数。您需要找到一个散列函数,将输入值映射到范围为0的值。n,对于所有你关心的有效值没有碰撞。也就是说,没有两个“有效”值散列到相同的输出值。在搜索一个好的散列函数时,您的目标是:

  • Keep the hash function reasonably fast.
  • 使哈希函数保持合理的速度。
  • Minimise n. The smallest you can get is 256 (minimal perfect hash function), but that's probably hard to achieve, depending on the data.
  • 最小化n.你能得到的最小值是256(最小完美哈希函数),但这可能很难实现,这取决于数据。

Note for efficient hash functions, n is often a power of 2, which is equivalent to a bitwise mask of low bits (AND operation). Example hash functions:

注意,对于高效的哈希函数,n通常是2的幂,相当于一个低比特位(和操作)的位掩码。哈希函数示例:

  • CRC of input bytes, modulo n.
  • 输入字节的CRC,模n。
  • ((x << i) ^ (x >> j) ^ (x << k) ^ ...) % n (picking as many i, j, k, ... as needed, with left or right shifts)
  • ((x < <我)^(x> > j)^(x < < k)^…)% n(选择尽可能多的i,j,k,…根据需要,使用左或右移位)

Then you make a fixed table of n entries, where the hash maps the input values to an index i into the table. For valid values, table entry i contains the valid value. For all other table entries, ensure that each entry of index i contains some other invalid value which doesn't hash to i.

然后创建一个包含n个条目的固定表,其中散列将输入值映射到表中的索引i。对于有效值,表项包含有效值。对于所有其他表条目,确保索引i的每个条目包含一些其他无效值,这些值不会哈希到i。

Then in your interrupt routine, with input x:

然后在你的中断程序中,输入x:

  1. Hash x to index i (which is in the range 0..n)
  2. 哈希x到索引i(范围为0..n)
  3. Look up entry i in the table and see if it contains the value x.
  4. 查找表中的条目i,看看它是否包含值x。

This will be much faster than a linear search of 256 or 1024 values.

这将比线性搜索256或1024的值要快得多。

I've written some Python code to find reasonable hash functions.

我编写了一些Python代码来查找合理的散列函数。

Binary search

If you sort your array of 256 "valid" values, then you can do a binary search, rather than a linear search. That means you should be able to search 256-entry table in only 8 steps (log2(256)), or a 1024-entry table in 10 steps. Again, this will be much faster than a linear search of 256 or 1024 values.

如果对256个“有效”值的数组进行排序,那么可以进行二进制搜索,而不是线性搜索。这意味着您应该能够在8个步骤中搜索256-entry表(log2(256)),或者在10个步骤中搜索1024-entry表。同样,这比线性搜索256或1024的值要快得多。

#4


59  

Keep the table in sorted order, and use Bentley's unrolled binary search:

保持表格的排序顺序,使用本特利的展开二进制搜索:

i = 0;
if (key >= a[i+512]) i += 512;
if (key >= a[i+256]) i += 256;
if (key >= a[i+128]) i += 128;
if (key >= a[i+ 64]) i +=  64;
if (key >= a[i+ 32]) i +=  32;
if (key >= a[i+ 16]) i +=  16;
if (key >= a[i+  8]) i +=   8;
if (key >= a[i+  4]) i +=   4;
if (key >= a[i+  2]) i +=   2;
if (key >= a[i+  1]) i +=   1;
return (key == a[i]);

The point is,

关键是,

  • if you know how big the table is, then you know how many iterations there will be, so you can fully unroll it.
  • 如果您知道这个表有多大,那么您就知道会有多少次迭代,因此您可以完全展开它。
  • Then, there's no point testing for the == case on each iteration because, except on the last iteration, the probability of that case is too low to justify spending time testing for it.**
  • 然后,在每个迭代中都没有对== case的点测试,因为,除了上一个迭代之外,该情况的概率太低,没有理由花时间对它进行测试
  • Finally, by expanding the table to a power of 2, you add at most one comparison, and at most a factor of two storage.
  • 最后,通过将表扩展到2的幂,最多添加一个比较,最多添加两个存储的因数。

** If you're not used to thinking in terms of probabilities, every decision point has an entropy, which is the average information you learn by executing it. For the >= tests, the probability of each branch is about 0.5, and -log2(0.5) is 1, so that means if you take one branch you learn 1 bit, and if you take the other branch you learn one bit, and the average is just the sum of what you learn on each branch times the probability of that branch. So 1*0.5 + 1*0.5 = 1, so the entropy of the >= test is 1. Since you have 10 bits to learn, it takes 10 branches. That's why it's fast!

**如果你不习惯用概率来思考,那么每个决策点都有一个熵,它是你通过执行它而获得的平均信息。> =的测试,每个分支的概率大约是0.5,和- log2(0.5)是1,这意味着如果你把一个分支你学习1位,如果你把其他分支你学习一点,和平均只是你学习在每个分支的总和乘以该分支的概率。所以1*0。5 + 1*0。5 = 1,所以>= test的熵是1。既然你有10位要学习,那就需要10个分支。这就是为什么快!

On the other hand, what if your first test is if (key == a[i+512)? The probability of being true is 1/1024, while the probability of false is 1023/1024. So if it's true you learn all 10 bits! But if it's false you learn -log2(1023/1024) = .00141 bits, practically nothing! So the average amount you learn from that test is 10/1024 + .00141*1023/1024 = .0098 + .00141 = .0112 bits. About one hundredth of a bit. That test is not carrying its weight!

另一方面,如果你的第一个测试是if (key = a[i+512])呢?正确的概率是1/1024,而false的概率是1023/1024。所以如果这是真的,你会学到所有的10比特!但是如果它是假的,你会学到-log2(1023/1024) =。00141位,实际上什么都没有!你从这个测试中学到的平均数量是10/1024 +。00141*1023/1024 =。0098 +。00141 =。0112位。大约百分之一。那个测试没有承载它的重量!

#5


16  

If the set of constants in your table is known in advance, you can use perfect hashing to ensure that only one access is made to the table. Perfect hashing determines a hash function that maps every interesting key to a unique slot (that table isn't always dense, but you can decide how un-dense a table you can afford, with less dense tables typically leading to simpler hashing functions).

如果您的表中的常量集是预先已知的,您可以使用完美的散列来确保只有一个对表的访问。完美散列确定一个散列函数,该函数将每个有趣的键映射到一个唯一的槽(该表并不总是稠密的,但是您可以决定您能负担得起的表的非稠密程度,而较低的表通常导致更简单的散列函数)。

Usually, the perfect hash function for the specific set of keys is relatively easy to compute; you don't want that to be long and complicated because that competes for time perhaps better spent doing multiple probes.

通常,特定键集的完美哈希函数比较容易计算;您不希望它太长、太复杂,因为这样可能会浪费时间,最好是花时间进行多个探测。

Perfect hashing is a "1-probe max" scheme. One can generalize the idea, with the thought that one should trade simplicity of computing the hash code with the time it takes to make k probes. After all, the goal is "least total time to look up", not fewest probes or simplest hash function. However, I've never seen anybody build a k-probes-max hashing algorithm. I suspect one can do it, but that's likely research.

完美哈希是一个“1探针max”方案。一个人可以概括这个想法,认为一个人应该用简单的方法来计算哈希代码,用时间来做k个探针。毕竟,目标是“查找的总时间最少”,而不是最少的探测或最简单的散列函数。但是,我从来没有见过任何人构建一个k-probes-max散列算法。我怀疑有人能做到,但那很可能是研究。

One other thought: if your processor is extremely fast, the one probe to memory from a perfect hash probably dominates the execution time. If the processor is not very fast, than k>1 probes might be practical.

另一个想法是:如果您的处理器非常快,那么来自完美散列的一个探针可能会控制执行时间。如果处理器不是非常快,比k>1探测器可能更实用。

#6


13  

Use a hash set. It will give O(1) lookup time.

使用散列集。它将提供O(1)查找时间。

The following code assumes that you can reserve value 0 as an 'empty' value, i.e. not occurring in actual data. The solution can be expanded for a situation where this is not the case.

下面的代码假设您可以将值0保留为“空”值,即不出现在实际数据中。在这种情况下,可以扩展解决方案。

#define HASH(x) (((x >> 16) ^ x) & 1023)
#define HASH_LEN 1024
uint32_t my_hash[HASH_LEN];

int lookup(uint32_t value)
{
    int i = HASH(value);
    while (my_hash[i] != 0 && my_hash[i] != value) i = (i + 1) % HASH_LEN;
    return i;
}

void store(uint32_t value)
{
    int i = lookup(value);
    if (my_hash[i] == 0)
       my_hash[i] = value;
}

bool contains(uint32_t value)
{
    return (my_hash[lookup(value)] == value);
}

In this example implementation, the lookup time will typically be very low, but at the worst case can be up to the number of entries stored. For a realtime application, you can consider also an implementation using binary trees, which will have a more predictable lookup time.

在这个示例实现中,查找时间通常非常短,但在最坏的情况下,查询时间最多可以达到存储的条目数量。对于实时应用程序,您还可以考虑使用二叉树的实现,它将具有更可预测的查找时间。

#7


9  

In this case, it might be worthwhile investigating bloom filters. They're capable of quickly establishing that a value is not present, which is a good thing since most of the 2^32 possible values are not in that 1024 element array. However, there are some false positives that will need an extra check.

在这种情况下,研究bloom filter可能是值得的。他们能够迅速建立一个值不存在,这是一件好事,因为大多数的2 ^ 32可能值不在1024元素数组。然而,有一些假阳性需要额外的检查。

Since your table is apparently static, you can determine which false positives exist for your Bloom filter and put those ins a perfect hash.

由于表显然是静态的,所以您可以确定Bloom filter存在哪些假阳性,并将它们放入一个完美的散列中。

#8


8  

Assuming your processor runs at 204 MHz which seems to be the maximum for the LPC4357, and also assuming your timing result reflects the average case (half of the array traversed), we get:

假设你的处理器运行在204 MHz,这似乎是LPC4357的最大值,同时假设你的时间结果反映了平均情况(遍历的数组的一半),我们得到:

  • CPU frequency: 204 MHz
  • CPU频率:204 MHz
  • Cycle period: 4.9 ns
  • 循环周期:4.9 ns
  • Duration in cycles: 12.5 µs / 4.9 ns = 2551 cycles
  • 时间周期:12.5µs / 4.9 ns = 2551周期
  • Cycles per iteration: 2551 / 128 = 19.9
  • 每次迭代周期:2551 / 128 = 19.9

So, your search loop spends around 20 cycles per iteration. That doesn't sound awful, but I guess that in order to make it faster you need to look at the assembly.

所以,你的搜索循环每次迭代大约花费20个周期。这听起来并不可怕,但我想为了让它更快,你需要看一下程序集。

I would recommend dropping the index and using a pointer comparison instead, and making all the pointers const.

我建议删除索引并使用指针比较,并使用所有指针const。

bool arrayContains(const uint32_t *array, size_t length)
{
  const uint32_t * const end = array + length;
  while(array != end)
  {
    if(*array++ == 0x1234ABCD)
      return true;
  }
  return false;
}

That's at least worth testing.

这至少值得一试。

#9


6  

Other people have suggested reorganizing your table, adding a sentinel value at the end, or sorting it in order to provide a binary search.

其他人建议重新组织您的表,在末尾添加一个前哨值,或者对其进行排序以提供二进制搜索。

You state "I also use pointer arithmetic and a for loop, which does down-counting instead of up (checking if i != 0 is faster than checking if i < 256)."

您可以声明“我还使用指针算法和for循环,它执行向下计数而不是向上计数(检查I != 0是否比检查I < 256更快)。”

My first advice is: get rid of the pointer arithmetic and the downcounting. Stuff like

我的第一个建议是:去掉指针运算和减少计数。之类的

for (i=0; i<256; i++)
{
    if (compareVal == the_array[i])
    {
       [...]
    }
}

tends to be idiomatic to the compiler. The loop is idiomatic, and the indexing of an array over a loop variable is idiomatic. Juggling with pointer arithmetic and pointers will tend to obfuscate the idioms to the compiler and make it generate code related to what you wrote rather than what the compiler writer decided to be the best course for the general task.

对编译器来说是习惯用法。循环是惯用的,数组对循环变量的索引也是惯用的。处理指针算术和指针会使编译器混淆习语,并使它生成与您所编写的内容相关的代码,而不是编译器编写人员所决定的一般任务的最佳过程。

For example, the above code might be compiled into a loop running from -256 or -255 to zero, indexing off &the_array[256]. Possibly stuff that is not even expressible in valid C but matches the architecture of the machine you are generating for.

例如,上面的代码可以编译成一个循环,从-256或-255到0,索引off &the_array[256]。可能是在有效的C语言中无法表达的东西,但与您正在生成的机器的体系结构相匹配。

So don't microoptimize. You are just throwing spanners into the works of your optimizer. If you want to be clever, work on the data structures and algorithms but don't microoptimize their expression. It will just come back to bite you, if not on the current compiler/architecture, then on the next.

所以不要microoptimize。您只是在优化器的工作中添加了扳手。如果你想变得聪明,可以研究数据结构和算法,但不要对它们的表达式进行微优化。它会回来咬你一口,如果不是在当前的编译器/体系结构上,那么在下一个。

In particular using pointer arithmetic instead of arrays and indexes is poison for the compiler being fully aware of alignments, storage locations, aliasing considerations and other stuff, and for doing optimizations like strength reduction in the way best suited to the machine architecture.

特别是使用指针算法而不是数组和索引对编译器来说是有害的,因为编译器完全了解对齐、存储位置、别名考虑和其他东西,并且以最适合机器架构的方式进行强度降低等优化。

#10


3  

Vectorization can be used here, as it is often is in implementations of memchr. You use the following algorithm:

矢量化可以在这里使用,因为它通常是在memchr的实现中。使用以下算法:

  1. Create a mask of your query repeating, equal in length to your OS'es bit count (64-bit, 32-bit, etc.). On a 64-bit system you would repeat the 32-bit query twice.

    创建查询重复的掩码,长度与OS的位计数相等(64位、32位等)。在64位系统上,您将重复32位查询两次。

  2. Process the list as a list of multiple pieces of data at once, simply by casting the list to a list of a larger data type and pulling values out. For each chunk, XOR it with the mask, then XOR with 0b0111...1, then add 1, then & with a mask of 0b1000...0 repeating. If the result is 0, there is definitely not a match. Otherwise, there may (usually with very high probability) be a match, so search the chunk normally.

    将列表一次作为多个数据片段的列表进行处理,只需将列表转换为更大的数据类型的列表并从中提取值。对于每个块,XOR与掩码,然后XOR与0b0111…然后加上1,然后加上一个0b1000的面具…0重复。如果结果为0,则肯定不存在匹配。否则,可能会有一个匹配(通常是非常高的概率),所以正常地搜索块。

Example implementation: https://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/memchr.c?rev=1.3&content-type=text/x-cvsweb-markup&cvsroot=src

示例实现:https://sourceware.org/cgi-bin/cvsweb.cgi/src/newlib/libc/string/memchr.c?rev=1.3&content-type=text/x-cvsweb-markup&cvsroot=src

#11


3  

If you can accommodate the domain of your values with the amount of memory that's available to your application, then, the fastest solution would be to represent your array as an array of bits:

如果您可以将您的值的域与应用程序可用的内存数量相匹配,那么,最快的解决方案将是将您的数组表示成一个比特数组:

bool theArray[MAX_VALUE]; // of which 1024 values are true, the rest false
uint32_t compareVal = 0x1234ABCD;
bool validFlag = theArray[compareVal];

EDIT

编辑

I'm astounded by the number of critics. The title of this thread is "How do I quickly find whether a value is present in a C array?" for which I will stand by my answer because it answers precisely that. I could argue that this has the most speed efficient hash function (since address === value). I've read the comments and I'm aware of the obvious caveats. Undoubtedly those caveats limit the range of problems this can be used to solve, but, for those problems that it does solve, it solves very efficiently.

我被这么多的批评惊呆了。这个线程的标题是“如何快速找到C数组中的值?”我可以说这是最快速有效的哈希函数(因为address ===值)。我读过这些评论,也意识到了一些明显的警告。毫无疑问,这些警告限制了可用来解决的问题的范围,但是,对于它确实解决的问题,它非常有效地解决了。

Rather than reject this answer outright, consider it as the optimal starting point for which you can evolve by using hash functions to achieve a better balance between speed and performance.

与其直接拒绝这个答案,不如把它看作是您可以通过使用散列函数在速度和性能之间实现更好的平衡的最佳起点。

#12


0  

I'm sorry if my answer was already answered - just I'm a lazy reader. Feel you free to downvote then ))

如果我的答案已经回答了,我很抱歉——只是我是个懒读者。觉得你可以投反对票)

1) you could remove counter 'i' at all - just compare pointers, ie

1)你可以删除计数器“i”——只需要比较指针即可

for (ptr = &the_array[0]; ptr < the_array+1024; ptr++)
{
    if (compareVal == *ptr)
    {
       break;
    }
}
... compare ptr and the_array+1024 here - you do not need validFlag at all.

all that won't give any significant improvement though, such optimization probably could be achieved by the compiler itself.

所有这些都不会带来任何显著的改进,但是这样的优化很可能是由编译器本身实现的。

2) As it was already mentioned by other answers, almost all modern CPU are RISC-based, for example ARM. Even modern Intel X86 CPUs use RISC cores inside, as far as I know (compiling from X86 on fly). Major optimization for RISC is pipeline optimization (and for Intel and other CPU as well), minimizing code jumps. One type of such optimization (probably a major one), is "cycle rollback" one. It's incredibly stupid, and efficient, even Intel compiler can do that AFAIK. It looks like:

2)正如其他答案中已经提到的,几乎所有现代CPU都是基于ria的,例如ARM。就我所知,即使是现代的Intel X86 cpu也使用RISC内核(从X86上进行编译)。RISC的主要优化是管道优化(以及Intel和其他CPU),最小化代码跳转。这种优化的一种类型(可能是主要的一种)是“循环回滚”。这是非常愚蠢和高效的,即使是英特尔编译器也能做到这一点。它看起来像:

if (compareVal == the_array[0]) { validFlag = true; goto end_of_compare; }
if (compareVal == the_array[1]) { validFlag = true; goto end_of_compare; }
...and so on...
end_of_compare:

This way the optimization is that the pipeline is not broken for the worst case (if compareVal is absent in the array), so it is as fast as possible (of course not counting algorithm optimizations such as hash tables, sorted arrays and so on, mentioned in other answers, which may give better results depending on array size. Cycles Rollback approach can be applied there as well by the way. I'm writing here about that I think I didn't see in others)

这种方式的优化管道不破碎的最坏的情况下(如果compareVal不在数组中),所以尽可能快(当然不包括如哈希表、算法优化排序数组等等,提到在其他的答案,这可能会给更好的结果根据数组的大小。循环回滚方法也可以在那里应用。我在这里写的是我在其他人身上没有看到的

The second part of this optimization is that that array item is taken by direct address (calculated at compiling stage, make sure you use a static array), and do not need additional ADD op to calculate pointer from array's base address. This optimization may not have significant effect, since AFAIK ARM architecture has special features to speed up arrays addressing. But anyway it's always better to know that you did all the best just in C code directly, right?

此优化的第二部分是,数组项是通过直接地址获取的(在编译阶段计算,确保使用静态数组),并且不需要额外的添加op来从数组的基本地址计算指针。这种优化可能没有显著的效果,因为AFAIK ARM架构具有加速数组寻址的特殊特性。但无论如何,最好是直接在C代码中完成所有最好的操作,对吧?

Cycle Rollback may look awkward due to waste of ROM (yep, you did right placing it to fast part of RAM, if your board supports this feature), but actually it's a fair pay for speed, being based on RISC concept. This is just a general point of calculation optimization - you sacrifice space for sake of speed, and vice versa, depending on your requirements.

由于ROM的浪费,循环回滚可能看起来很尴尬(是的,如果您的董事会支持这个特性,您将它放置到RAM的快速部分是正确的),但是实际上,基于RISC概念,这是对速度的公平补偿。这只是计算优化的一个一般点——为了速度而牺牲空间,反之亦然,这取决于您的需求。

If you think that rollback for array of 1024 elements is too large sacrifice for your case, you can consider 'partial rollback', for example dividing the array into 2 parts of 512 items each, or 4x256, and so on.

如果您认为对1024个元素的数组进行回滚的代价太大,可以考虑“部分回滚”,例如将数组分成512个元素的2个部分,或者4x256,等等。

3) modern CPU often support SIMD ops, for example ARM NEON instruction set - it allows to execute the same ops in parallel. Frankly speaking I do not remember if it is suitable for comparison ops, but I feel it may be, you should check that. Googling shows that there may be some tricks as well, to get max speed, see https://*.com/a/5734019/1028256

3)现代CPU通常支持SIMD操作,例如ARM NEON指令集——它允许并行执行相同的操作。坦白地说,我不记得它是否适合比较操作,但我觉得可能是,你应该检查一下。谷歌搜索也显示了一些技巧,要获得最大速度,请参见https://*.com/a/5734019/1028256

I hope it can give you some new ideas.

我希望它能给你一些新的想法。

#13


0  

This is more like an addendum than an answer.

这更像是附录而不是答案。

I've had a simillar case in the past, but my array was constant over a considerable number of searches.

我以前遇到过一个类似的情况,但是我的数组在相当多的搜索中是常量。

In half of them, the searched value was NOT present in array. Then I realized I could apply a "filter" before doing any search.

在其中的一半中,搜索值不在数组中。然后我意识到我可以在做任何搜索之前应用一个“过滤器”。

This "filter" is just a simple integer number, calculated ONCE and used in each search.

这个“过滤器”只是一个简单的整数,计算一次并在每次搜索中使用。

It's in java, but it's pretty simple:

它在java中,但很简单:

binaryfilter = 0;
for (int i = 0; i < array.length; i++)
{
    // just apply "Binary OR Operator" over values.
    binaryfilter = binaryfilter | array[i]; 
}

So, before do a binary search, I check binaryfilter:

在进行二进制搜索之前,我先检查一下binaryfilter:

// check binaryfilter vs value with a "Binary AND Operator"
if ( (binaryfilter & valuetosearch) != valuetosearch) 
{
    // valuetosearch is not in the array!
    return false;
}
else
{
    // valuetosearch MAYBE in the array, so let's check it out
    // ... do binary search stuff ...

}

You can use a 'better' hash algorithm, but this can be very fast, specially for large numbers. May be this could save you even more cycles.

您可以使用“更好的”哈希算法,但这可能非常快,特别是对于大量的数据。也许这可以为你节省更多的周期。

#14


0  

I'm a great fan of hashing. The problem of course is to find an efficient algorithm that is both fast and uses a minimum amount of memory (especially on an embedded processor).

我非常喜欢散列。当然,问题是要找到一种既快又使用最少内存(特别是在嵌入式处理器上)的高效算法。

If you know beforehand the values that may occur you can create a program that runs through a multitude of algorithms to find the best one - or, rather, the best parameters for your data.

如果您事先知道可能出现的值,您可以创建一个程序,该程序通过大量的算法来找到最好的一个,或者更确切地说,为您的数据找到最好的参数。

I created such a program that you can read about in this post and achieved some very fast results. 16000 entries translates roughly to 2^14 or an average of 14 comparisons to find the value using a binary search. I explicitly aimed for very fast lookups - on average finding the value in <=1.5 lookups - which resulted in greater RAM requirements. I believe that with a more conservative average value (say <=3) a lot of memory could be saved. By comparison the average case for a binary search on your 256 or 1024 entries would result in an average number of comparisons of 8 and 10, respectively.

我创建了这样一个程序,你可以在这篇文章中看到,并取得了一些非常快的结果。大约16000个条目翻译2 ^ 14或平均14比较发现使用二进制搜索的值。我明确的目标是非常快速的查找——平均查找<=1.5查找的值——这会导致更大的RAM需求。我认为,使用更保守的平均值(比如<=3)可以节省大量内存。通过比较,在256或1024个条目上进行二进制搜索的平均情况将分别导致8和10的平均比较数。

My average lookup required around 60 cycles (on a laptop with an intel i5) with a generic algorithm (utilizing one division by a variable) and 40-45 cycles with a specialized (probably utilizing a multiplication). This should translate into sub-microsecond lookup times on your MCU, depending of course on the clock frequency it executes at.

我的平均查找需要大约60个周期(在使用intel i5的笔记本电脑上)和一个通用算法(用一个变量除以一个变量)以及40-45个周期(可能用一个乘法)。这将转化为您的MCU上的亚微秒查找时间,当然这取决于它执行的时钟频率。

It can be real-life-tweaked further if the entry array keeps track of how many times an entry was accessed. If the entry array is sorted from most to least accessed before the indeces are computed then it'll find the most commonly occuring values with a single comparison.

如果条目数组跟踪一个条目被访问了多少次,则可以对其进行实际的进一步调整。如果在计算indeces之前,条目数组被从最多访问到最少访问进行排序,那么它将通过单个比较找到最常见的出现值。