为什么Lucene使用数组而不是哈希表作为倒排索引?

时间:2022-10-06 04:15:48

I was watching Adrien Grand's talk on Lucene's index architecture and a point he makes is that Lucene uses sorted arrays to represent the dictionary part of its inverted indices. What's the reasoning behind using sorted arrays instead of hash tables (the "classic" inverted index data structure)?

我正在观看Adrien Grand关于Lucene索引架构的演讲,他的观点是Lucene使用排序数组来表示其倒排索引的字典部分。使用排序数组而不是哈希表(“经典”反向索引数据结构)背后的原因是什么?

Hash tables provide O(1) insertion and access, which to me seems like it would help a lot with quickly processing queries and merging index segments. On the other hand, sorted arrays can only offer up O(logN) access and (gasp) O(N) insertion, although merging 2 sorted arrays is the same complexity as merging 2 hash tables.

哈希表提供O(1)插入和访问,对我来说似乎对快速处理查询和合并索引段有很大帮助。另一方面,排序的数组只能提供O(logN)访问和(gasp)O(N)插入,尽管合并2个排序的数组与合并2个哈希表具有相同的复杂性。

The only downsides to hash tables that I can think of are a larger memory footprint (this could indeed be a problem) and less cache friendliness (although operations like querying a sorted array require binary search which is just as cache unfriendly).

我能想到的哈希表的唯一缺点是更大的内存占用(这可能确实是一个问题)和更少的缓存友好性(尽管像查询排序数组这样的操作需要二进制搜索,这就像缓存不友好一样)。

So what's up? The Lucene devs must have had a very good reason for using arrays. Is it something to do with scalability? Disk read speeds? Something else entirely?

那么这是什么一回事? Lucene开发人员必须有一个很好的理由使用数组。它与可扩展性有关吗?磁盘读取速度?还有别的吗?

1 个解决方案

#1


2  

Well, I will speculate here (should probably be a comment - but it's going to be too long).

好吧,我会在这里推测(应该是评论 - 但它会太长)。

  1. HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.

    HashMap通常是一个快速查找结构,其搜索时间为O(1) - 意味着它是常量。但这是一般情况;因为(至少在Java中)HashMap使用TreeNodes - 该桶内的搜索是O(logn)。即使我们认为他们的搜索复杂度是O(1),但这并不意味着它是同一时间的明智。它只是意味着每个独立的数据结构都是不变的。

  2. Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.

    记忆确实 - 我会举一个例子。简而言之,存储15_000_000个条目需要略多于1GB的RAM;排序的数组可能更紧凑,特别是因为它们可以保存基元而不是对象。

  3. Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.

    将条目放入HashMap(通常)需要重新散列所有密钥,这可能是一个重要的性能损失,因为它们都必须潜在地移动到不同的位置。

  4. Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).

    这里可能还有一个额外点 - 在范围内搜索,可能需要一些TreeMap,这里的数组更适合。我正在考虑对索引进行分区(可能是他们在内部进行)。

  5. I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.

    我有与你相同的想法 - 数组通常是连续的内存,可能更容易被CPU预取。

  6. And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.

    最后一点:把我放到他们的鞋子里,我先从HashMap开始......我相信他们的决定有令人信服的理由。我想知道他们是否有实际测试来证明这一选择。

#1


2  

Well, I will speculate here (should probably be a comment - but it's going to be too long).

好吧,我会在这里推测(应该是评论 - 但它会太长)。

  1. HashMap is in general a fast look-up structure that has search time O(1) - meaning it's constant. But that is the average case; since (at least in Java) a HashMap uses TreeNodes - the search is O(logn) inside that bucket. Even if we treat that their search complexity is O(1), it does not mean it's the same time wise. It just means it is constant for each separate data structure.

    HashMap通常是一个快速查找结构,其搜索时间为O(1) - 意味着它是常量。但这是一般情况;因为(至少在Java中)HashMap使用TreeNodes - 该桶内的搜索是O(logn)。即使我们认为他们的搜索复杂度是O(1),但这并不意味着它是同一时间的明智。它只是意味着每个独立的数据结构都是不变的。

  2. Memory Indeed - I will give an example here. In short storing 15_000_000 entries would require a little over 1GB of RAM; the sorted arrays are probably much more compact, especially since they can hold primitives, instead of objects.

    记忆确实 - 我会举一个例子。简而言之,存储15_000_000个条目需要略多于1GB的RAM;排序的数组可能更紧凑,特别是因为它们可以保存基元而不是对象。

  3. Putting entries in a HashMap (usually) requires all the keys to re-hashed that could be a significant performance hit, since they all have to move to different locations potentially.

    将条目放入HashMap(通常)需要重新散列所有密钥,这可能是一个重要的性能损失,因为它们都必须潜在地移动到不同的位置。

  4. Probably one extra point here - searches in ranges, that would require some TreeMap probably, wheres arrays are much more suited here. I'm thinking about partitioning an index (may be they do it internally).

    这里可能还有一个额外点 - 在范围内搜索,可能需要一些TreeMap,这里的数组更适合。我正在考虑对索引进行分区(可能是他们在内部进行)。

  5. I have the same idea as you - arrays are usually contiguous memory, probably much easier to be pre-fetched by a CPU.

    我有与你相同的想法 - 数组通常是连续的内存,可能更容易被CPU预取。

  6. And the last point: put me into their shoes, I would start with a HashMap first... I am sure there are compelling reasons for their decision. I wonder if they have actual tests that prove this choice.

    最后一点:把我放到他们的鞋子里,我先从HashMap开始......我相信他们的决定有令人信服的理由。我想知道他们是否有实际测试来证明这一选择。