为什么DB索引使用平衡树,而不是散列表?

时间:2022-09-20 13:45:10

Hashtables seem to be preferable in terms of disk access. What is the real reason that indexes usually implemented with a tree? Sorry if it's infantile, but i did not find the straight answer on SO.

在磁盘访问方面,散列表似乎更可取。索引通常使用树实现的真正原因是什么?对不起,如果是幼稚的,但我没有找到正确的答案。

7 个解决方案

#1


16  

Size, btrees start small and perfectly formed and grow nicely to enormous sizes. Hashes have a fixed size which can be too big (10,000 buckets for 1000 entries) or too small (10,000 buckets for 1,000,000,000 entries) for the amount of data you have.

大小,btree开始小和完美的成形,并成长为巨大的规模。散列有一个固定的大小,它可以是太大的(10000个桶,1000个条目),或者是太小(10000个桶),因为你拥有的数据量。

#2


15  

One of the common actions with data is to sort it or to search for data in a range - a tree will contain data in order while a hash table is only useful for looking up a row and has no idea of what the next row is.

使用数据的常见操作之一是对数据进行排序或在范围内搜索数据——树将按顺序包含数据,而哈希表仅用于查找一行,并且不知道下一行是什么。

So hash tables are no good for this common case, thanks to this answer

所以哈希表对于这种常见的情况并不好,多亏了这个答案

SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000

or

SELECT * FROM MyTable ORDER BY x

Obviously there are cases where hash tables are better but best to deal with the main cases first.

显然,有些情况下哈希表更好,但最好先处理主要的情况。

#3


9  

Hash tables provide no benefit for this case:

散列表对这种情况没有好处:

SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000

#4


3  

One has to only look at MySQL's hash index implementation associated with MEMORY storage engine to see its disadvantages:

只需查看与内存存储引擎相关的MySQL哈希索引实现,就会发现它的缺点:

  1. They can be used with equality operators such as = but not with comparison operators such as <
  2. 它们可以用于等等操作符,如=,但不能用于比较操作符,如<
  3. The optimizer cannot use a hash index to speed up ORDER BY operations.
  4. 优化器不能使用哈希索引来通过操作来加快顺序。
  5. Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
  6. 只有整个键可以用来搜索一行。(使用B-tree索引,可以使用键的任何最左前缀来查找行。)
  7. Optimizer cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use).
  8. 优化器不能大约确定两个值之间有多少行(range优化器使用这个来决定使用哪个索引)。

And note that the above applies to hash indexes implemented in memory, without the added consideration of disk access matters associated with indexes implemented on disk. Disk access factors as noted by @silentbicycle would skew it in favour of the balanced-tree index even more.

注意,上面的方法适用于在内存中实现的散列索引,而不需要考虑与在磁盘上实现的索引相关的磁盘访问问题。如@ silentbike所指出的磁盘访问因子将使其更倾向于平衡树索引。

#5


2  

Databases typically use B+ trees (a specific kind of tree), since they have better disk access properties - each node can be made the size of a filesystem block. Doing as few disk reads as possible has a greater impact on speed, since comparatively little time is spent on either chasing pointers in a tree or hashing.

数据库通常使用B+树(一种特定类型的树),因为它们有更好的磁盘访问属性——每个节点都可以设置文件系统块的大小。尽可能少地读取磁盘会对速度产生更大的影响,因为在跟踪树中的指针或散列上花费的时间相对较少。

#6


-1  

Hasing is good when the data is not increasing, more techically when N/n is constant .. where N = No of elements and n = hash slots ..

当数据不增加时,高频是好的,当N/ N为常数时,高频是好的。其中N =无元素,N =哈希槽。

If this is not the case hashing doesnt give a good performance gain.

如果不是这样,散列不能获得良好的性能收益。

In database most probably the data would be increasing a significant pace so using hash there is not a good idea.

在数据库中,数据很可能会显著增加,所以使用散列不是一个好主意。

and yes sorting is there too ...

是的,排序也是存在的……

#7


-1  

"In database most probably the data would be increasing a significant pace so using hash there is not a good idea."

“在数据库中,数据很可能会显著增加,所以使用散列不是一个好主意。”

That is an over-exaggeration of the problem. Yes hash spaces must be fixed in size (modulo solutions ala extensible hashing) and yes, their size must be managed, and yes, someone must do that job.

这是对问题的过分夸大。是的,哈希空间的大小必须是固定的(模块化解决方案即可扩展哈希),是的,它们的大小必须被管理,是的,必须有人来做这个工作。

That said, the performance gains if you exploit hash-based physical location to its fullest potential, are enormous.

也就是说,如果充分利用基于散列的物理位置,性能将得到巨大的提高。

#1


16  

Size, btrees start small and perfectly formed and grow nicely to enormous sizes. Hashes have a fixed size which can be too big (10,000 buckets for 1000 entries) or too small (10,000 buckets for 1,000,000,000 entries) for the amount of data you have.

大小,btree开始小和完美的成形,并成长为巨大的规模。散列有一个固定的大小,它可以是太大的(10000个桶,1000个条目),或者是太小(10000个桶),因为你拥有的数据量。

#2


15  

One of the common actions with data is to sort it or to search for data in a range - a tree will contain data in order while a hash table is only useful for looking up a row and has no idea of what the next row is.

使用数据的常见操作之一是对数据进行排序或在范围内搜索数据——树将按顺序包含数据,而哈希表仅用于查找一行,并且不知道下一行是什么。

So hash tables are no good for this common case, thanks to this answer

所以哈希表对于这种常见的情况并不好,多亏了这个答案

SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000

or

SELECT * FROM MyTable ORDER BY x

Obviously there are cases where hash tables are better but best to deal with the main cases first.

显然,有些情况下哈希表更好,但最好先处理主要的情况。

#3


9  

Hash tables provide no benefit for this case:

散列表对这种情况没有好处:

SELECT * FROM MyTable WHERE Val BETWEEN 10000 AND 12000

#4


3  

One has to only look at MySQL's hash index implementation associated with MEMORY storage engine to see its disadvantages:

只需查看与内存存储引擎相关的MySQL哈希索引实现,就会发现它的缺点:

  1. They can be used with equality operators such as = but not with comparison operators such as <
  2. 它们可以用于等等操作符,如=,但不能用于比较操作符,如<
  3. The optimizer cannot use a hash index to speed up ORDER BY operations.
  4. 优化器不能使用哈希索引来通过操作来加快顺序。
  5. Only whole keys can be used to search for a row. (With a B-tree index, any leftmost prefix of the key can be used to find rows.)
  6. 只有整个键可以用来搜索一行。(使用B-tree索引,可以使用键的任何最左前缀来查找行。)
  7. Optimizer cannot determine approximately how many rows there are between two values (this is used by the range optimizer to decide which index to use).
  8. 优化器不能大约确定两个值之间有多少行(range优化器使用这个来决定使用哪个索引)。

And note that the above applies to hash indexes implemented in memory, without the added consideration of disk access matters associated with indexes implemented on disk. Disk access factors as noted by @silentbicycle would skew it in favour of the balanced-tree index even more.

注意,上面的方法适用于在内存中实现的散列索引,而不需要考虑与在磁盘上实现的索引相关的磁盘访问问题。如@ silentbike所指出的磁盘访问因子将使其更倾向于平衡树索引。

#5


2  

Databases typically use B+ trees (a specific kind of tree), since they have better disk access properties - each node can be made the size of a filesystem block. Doing as few disk reads as possible has a greater impact on speed, since comparatively little time is spent on either chasing pointers in a tree or hashing.

数据库通常使用B+树(一种特定类型的树),因为它们有更好的磁盘访问属性——每个节点都可以设置文件系统块的大小。尽可能少地读取磁盘会对速度产生更大的影响,因为在跟踪树中的指针或散列上花费的时间相对较少。

#6


-1  

Hasing is good when the data is not increasing, more techically when N/n is constant .. where N = No of elements and n = hash slots ..

当数据不增加时,高频是好的,当N/ N为常数时,高频是好的。其中N =无元素,N =哈希槽。

If this is not the case hashing doesnt give a good performance gain.

如果不是这样,散列不能获得良好的性能收益。

In database most probably the data would be increasing a significant pace so using hash there is not a good idea.

在数据库中,数据很可能会显著增加,所以使用散列不是一个好主意。

and yes sorting is there too ...

是的,排序也是存在的……

#7


-1  

"In database most probably the data would be increasing a significant pace so using hash there is not a good idea."

“在数据库中,数据很可能会显著增加,所以使用散列不是一个好主意。”

That is an over-exaggeration of the problem. Yes hash spaces must be fixed in size (modulo solutions ala extensible hashing) and yes, their size must be managed, and yes, someone must do that job.

这是对问题的过分夸大。是的,哈希空间的大小必须是固定的(模块化解决方案即可扩展哈希),是的,它们的大小必须被管理,是的,必须有人来做这个工作。

That said, the performance gains if you exploit hash-based physical location to its fullest potential, are enormous.

也就是说,如果充分利用基于散列的物理位置,性能将得到巨大的提高。