在MongoDB中为什么和何时需要重构索引?

时间:2022-09-11 19:47:44

Been working with MongoDB for a while and today I had a doubt while discussing with a colleague.

与MongoDB一起工作了一段时间,今天我在与一位同事讨论时产生了疑问。

The thing is that when you create an index in MongoDB, the collection is processed and the index is built.

问题是,当您在MongoDB中创建索引时,将处理集合并构建索引。

The index is updated within insertion and deletion of documents so I don't really see the need to run a rebuild index operation (which drops the index and then rebuild it).

索引是在插入和删除文档中更新的,所以我不认为需要运行重新构建索引操作(删除索引然后重新构建索引)。

According to MongoDB documentation:

根据MongoDB文档:

Normally, MongoDB compacts indexes during routine updates. For most users, the reIndex command is unnecessary. However, it may be worth running if the collection size has changed significantly or if the indexes are consuming a disproportionate amount of disk space.

通常,MongoDB会在常规更新期间压缩索引。对于大多数用户来说,reIndex命令是不必要的。但是,如果集合的大小发生了显著的变化,或者索引占用了过多的磁盘空间,那么运行它可能是值得的。

Does someone has had the need of running a rebuild index operation that worth it?

是否有人需要运行一个值得的重建索引操作?

2 个解决方案

#1


11  

As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.

根据MongoDB文档,通常不需要定期重构索引。

NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.

注意:MongoDB 3.0+引入了可插拔存储引擎API,任何有关存储的建议都变得更加有趣。下面是我对MongoDB 3.0和更早版本中默认的MMAP存储引擎的评论。WiredTiger和其他存储引擎对数据和索引有不同的存储实现。

There may be some benefit in rebuilding an index with the MMAP storage engine if:

如果:

  • An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.

    与数据相比,索引消耗的空间超出了预期。注意:您需要监视历史数据和索引大小,以便有一个基线进行比较。

  • You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.

    您希望从旧的索引格式迁移到新的索引格式。如果可以添加索引,则在升级说明中会提到。例如,MongoDB 2.0引入了显著的索引性能改进,因此发布说明包含了升级后对v2.0格式的重新索引建议。类似地,MongoDB 2.6引入了2dsphere (v2.0)索引,这些索引具有不同的默认行为(缺省情况下是稀疏的)。在索引版本升级之后,不重建现有的索引;是否/何时升级由数据库管理员决定。

  • You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.

    您已经将集合的_id格式更改为或从单调递增的键(例如)。目标)到一个随机值。这有点深奥,但如果插入的是一直在增加的_id (ref: SERVER-983),那么有一种索引优化方法可以将b-tree bucket分割为90/10(而不是50/50)。如果您的_id的性质发生了显著变化,那么可以使用重新索引来构建更有效的b树。

For more information on general B-tree behaviour, see: Wikipedia: B-tree

有关一般b -树行为的更多信息,请参见:Wikipedia: B-tree。

Visualising index usage

If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:

如果您真的很想深入研究索引内部,那么您可以尝试一些实验性的命令/工具。我希望这些限制仅限于MongoDB 2.4和2.6:

#2


2  

While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.

虽然我不知道确切的技术原因,为什么在MongoDB中,我可以基于我所知道的其他系统的索引和您所引用的文档,对这个问题做出一些假设。

The General Idea Of An Index

When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow

当从一个文档移动到下一个文档时,在完整的文档集合中,会有大量的时间和精力被浪费,忽略了所有不需要处理的数据。如果您正在寻找id为“1234”的文档,则必须遍历每个文档的100K以上,这将使其速度变慢

Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.

索引不必搜索集合中每个文档的所有内容(物理地移动磁盘读头,等等),而只需快速搜索。它基本上是一个键/值对,为您提供该文档的id和位置。MongoDB可以快速扫描索引中的所有id,找到它需要的文档的位置,然后直接加载它们。

Allocating File Size For An Index

Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.

索引占用磁盘空间,因为它们基本上是存储在更小的位置的键/值对。如果您有一个非常大的集合(集合中有大量的项),那么索引的大小就会增加。

Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.

大多数操作系统以特定的块大小分配磁盘空间块。大多数数据库还根据需要以大块的形式分配磁盘空间。

Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.

当添加100K的文档时,MongoDB不会增加100K的文件大小,它可能会增加1MB或者10MB之类的东西——我不知道实际的增长大小是多少。在SQL Server中,您可以告诉它增长有多快,MongoDB可能有类似的东西。

Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.

以块形式增长使文档能够更快地“增长”到空间中,因为数据库不需要不断地扩展。如果数据库现在已经分配了10MB的空间,那么它就可以使用这个空间。它不必为每个文档不断扩展文件。它只需要将数据写入文件。

This is probably true of collections and indexes for collections - anything that is stored on disk.

对于集合的集合和索引——任何存储在磁盘上的东西——可能是这样的。

File Size And Index Re-Building

When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.

当一个大型集合中添加和删除了大量文档时,索引就会变得支离破碎。索引键可能不是按顺序排列的,因为在需要构建索引的时候,索引文件中间有空间,而不是在末尾。索引键之间也可能有很多空间。

If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.

如果索引中有10,000个条目,并且# 10,001需要插入,那么可以将其插入到索引文件的中间。现在,该指数需要重新构建,使一切恢复正常。这涉及到移动大量数据,以便在文件末尾留出空间,并在末尾放置# 10,001项。

If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.

如果索引不断地被删除——大量的东西被删除和添加——它可能会更快地增加索引文件的大小,并且总是把东西放在最后。这可以快速创建索引,但在删除旧内容的文件中留下空洞。

If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.

如果索引文件有一个空的空间,而删除的内容以前是空的,那么在读取索引时这是浪费精力。索引文件有比需要更多的移动,以到达索引中的下一项。因此,指数自我修复……对于非常大的集合,或者对集合进行非常大的更改,这是非常耗时的。

Rebuild For A Large Index File

It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.

需要大量的磁盘访问和I/O操作才能正确地将索引文件压缩到一个合理的大小,并且所有内容都是有序的。将物品从放置的地方移到临时位置,在正确的位置释放空间,将它们移回来。哦,顺便说一下,为了腾出空间,你必须把其他物品移到临时位置。递归和严厉。

Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.

因此,如果一个集合中有大量的项,并且该集合中定期添加和删除项,则可能需要重新构建索引。这样做将擦除当前索引文件并从头开始重新构建——这可能比尝试在现有文件中执行数千个操作要快。它不是移动物体,而是按顺序从头开始写。

Large Change In Collection Size

Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.

考虑到上面假设的所有内容,集合大小的巨大变化将导致这种抖动。如果你收集了10000个文档,你删除了8000个……现在,你的索引文件中有了以前8000个项目的空空间。MongoDB需要将剩下的2,000个项目移动到物理文件中,以紧凑的形式重新构建。

Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.

与其等待8000个空地被清理,不如用剩下的2000件物品从地面上重建。

Conclusion? Maybe?

So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.

因此,您所引用的文档可能会处理“大数据”需求或高抖动的集合和索引。

Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.

还要记住,我是根据我对索引、磁盘分配、文件碎片等等的了解来进行推测的。

My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.

我的猜测是,文档中的“大多数用户”意味着,99.9%或更多的mongodb集合不需要担心这个问题。

MongoDB specific case

According to MongoDB documentation:

根据MongoDB文档:

The remove() method does not remove the indexes

remove()方法不删除索引

So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.

因此,如果从集合中删除文档,就浪费了磁盘空间,除非重新构建该集合的索引。

#1


11  

As per the MongoDB documentation, there is generally no need to routinely rebuild indexes.

根据MongoDB文档,通常不需要定期重构索引。

NOTE: Any advice on storage becomes more interesting with MongoDB 3.0+, which introduced a pluggable storage engine API. My comments below are specifically in reference to the default MMAP storage engine in MongoDB 3.0 and earlier. WiredTiger and other storage engines have different storage implementations for data & indexes.

注意:MongoDB 3.0+引入了可插拔存储引擎API,任何有关存储的建议都变得更加有趣。下面是我对MongoDB 3.0和更早版本中默认的MMAP存储引擎的评论。WiredTiger和其他存储引擎对数据和索引有不同的存储实现。

There may be some benefit in rebuilding an index with the MMAP storage engine if:

如果:

  • An index is consuming a larger than expected amount of space compared to the data. Note: you need to monitor historical data & index size to have a baseline for comparison.

    与数据相比,索引消耗的空间超出了预期。注意:您需要监视历史数据和索引大小,以便有一个基线进行比较。

  • You want to migrate from an older index format to a newer one. If a reindex is advisible this will be mentioned in the upgrade notes. For example, MongoDB 2.0 introduced significant index performance improvements so the release notes include a suggested reindex to the v2.0 format after upgrading. Similarly, MongoDB 2.6 introduced 2dsphere (v2.0) indexes which have a different default behaviour (sparse by default). Existing indexes are not rebuilt after index version upgrades; the choice of if/when to upgrade is left to the database administrator.

    您希望从旧的索引格式迁移到新的索引格式。如果可以添加索引,则在升级说明中会提到。例如,MongoDB 2.0引入了显著的索引性能改进,因此发布说明包含了升级后对v2.0格式的重新索引建议。类似地,MongoDB 2.6引入了2dsphere (v2.0)索引,这些索引具有不同的默认行为(缺省情况下是稀疏的)。在索引版本升级之后,不重建现有的索引;是否/何时升级由数据库管理员决定。

  • You have changed the _id format for a collection to or from a monotonically increasing key (eg. ObjectID) to a random value. This is a bit esoteric, but there's an index optimisation that splits b-tree buckets 90/10 (instead of 50/50) if you are inserting _ids that are always increasing (ref: SERVER-983). If the nature of your _ids changes significantly, it may be possible to build a more efficient b-tree with a re-index.

    您已经将集合的_id格式更改为或从单调递增的键(例如)。目标)到一个随机值。这有点深奥,但如果插入的是一直在增加的_id (ref: SERVER-983),那么有一种索引优化方法可以将b-tree bucket分割为90/10(而不是50/50)。如果您的_id的性质发生了显著变化,那么可以使用重新索引来构建更有效的b树。

For more information on general B-tree behaviour, see: Wikipedia: B-tree

有关一般b -树行为的更多信息,请参见:Wikipedia: B-tree。

Visualising index usage

If you're really curious to dig into the index internals a bit more, there are some experimental commands/tools you can try. I expect these are limited to MongoDB 2.4 & 2.6 only:

如果您真的很想深入研究索引内部,那么您可以尝试一些实验性的命令/工具。我希望这些限制仅限于MongoDB 2.4和2.6:

#2


2  

While I don't know the exact technical reasons why, in MongoDB, I can make some assumptions about this, based on what I know about indexing from other systems and based on the documentation that you quoted.

虽然我不知道确切的技术原因,为什么在MongoDB中,我可以基于我所知道的其他系统的索引和您所引用的文档,对这个问题做出一些假设。

The General Idea Of An Index

When moving from one document to the next, in the full document collection, there is a lot of wasted time and effort skipping past all the data that doesn't need to be dealt with. If you're looking for document with id "1234", having to move through 100K+ of each document makes it slow

当从一个文档移动到下一个文档时,在完整的文档集合中,会有大量的时间和精力被浪费,忽略了所有不需要处理的数据。如果您正在寻找id为“1234”的文档,则必须遍历每个文档的100K以上,这将使其速度变慢

Rather than having to search through all of the content of each document in the collection (physically moving the disk read heads, etc), an index makes this fast. It's basically a key/value pair that gives you the id and the location of that document. MongoDB can quickly scan through all of the id's in the index, find the locations of the documents that it needs, and go load them directly.

索引不必搜索集合中每个文档的所有内容(物理地移动磁盘读头,等等),而只需快速搜索。它基本上是一个键/值对,为您提供该文档的id和位置。MongoDB可以快速扫描索引中的所有id,找到它需要的文档的位置,然后直接加载它们。

Allocating File Size For An Index

Indexes take up disk space because they are basically a key/value pair stored in a much smaller location. If you have a very large collection (large number of items in the collection) then your index grows in size.

索引占用磁盘空间,因为它们基本上是存储在更小的位置的键/值对。如果您有一个非常大的集合(集合中有大量的项),那么索引的大小就会增加。

Most operating systems allocate chunks of disk space in certain block sizes. Most database also allocate disk space in large chunks, as needed.

大多数操作系统以特定的块大小分配磁盘空间块。大多数数据库还根据需要以大块的形式分配磁盘空间。

Instead of growing 100K of file size when 100K of documents are added, MongoDB will probably grow 1MB or maybe 10MB or something - I don't know what the actual growth size is. In SQL Server, you can tell it how fast to grow, and MongoDB probably has something like that.

当添加100K的文档时,MongoDB不会增加100K的文件大小,它可能会增加1MB或者10MB之类的东西——我不知道实际的增长大小是多少。在SQL Server中,您可以告诉它增长有多快,MongoDB可能有类似的东西。

Growing in chunks give the ability to 'grow' the documents in to the space faster because the database doesn't need to constantly expand. If the database now has 10MB of space already allocated, it can just use that space up. It doesn't have to keep expanding the file for each document. It just has to write the data to the file.

以块形式增长使文档能够更快地“增长”到空间中,因为数据库不需要不断地扩展。如果数据库现在已经分配了10MB的空间,那么它就可以使用这个空间。它不必为每个文档不断扩展文件。它只需要将数据写入文件。

This is probably true of collections and indexes for collections - anything that is stored on disk.

对于集合的集合和索引——任何存储在磁盘上的东西——可能是这样的。

File Size And Index Re-Building

When a large collection has a lot of documents added and removed, the index becomes fragmented. index keys may not be in order because there was room in the middle of the index file and not at the end, when the index needed to be built. Index keys may have a lot of space in between them, as well.

当一个大型集合中添加和删除了大量文档时,索引就会变得支离破碎。索引键可能不是按顺序排列的,因为在需要构建索引的时候,索引文件中间有空间,而不是在末尾。索引键之间也可能有很多空间。

If there are 10,000 items in the index, and # 10,001 needs to be inserted, it may be inserted in the middle of the index file. Now the index needs to re-build itself to put everything back in order. This involves moving a lot of data around, to make room at the end of the file and put item # 10,001 at the end.

如果索引中有10,000个条目,并且# 10,001需要插入,那么可以将其插入到索引文件的中间。现在,该指数需要重新构建,使一切恢复正常。这涉及到移动大量数据,以便在文件末尾留出空间,并在末尾放置# 10,001项。

If the index is constantly being thrashed - lots of stuff removed and added - it's probably faster to just grow the index file size and always put stuff at the end. this is fast to create the index, but leaves empty holes in the file where old things were deleted.

如果索引不断地被删除——大量的东西被删除和添加——它可能会更快地增加索引文件的大小,并且总是把东西放在最后。这可以快速创建索引,但在删除旧内容的文件中留下空洞。

If the index file has empty space where deleted things used to be, this is wasted effort when reading the index. The index file has more movement than needed, to get to the next item in the index. So, the index repairs itself... which can be time consuming for very large collections or very large changes to a collection.

如果索引文件有一个空的空间,而删除的内容以前是空的,那么在读取索引时这是浪费精力。索引文件有比需要更多的移动,以到达索引中的下一项。因此,指数自我修复……对于非常大的集合,或者对集合进行非常大的更改,这是非常耗时的。

Rebuild For A Large Index File

It can take a lot of disk access and I/O operations to correctly compact the index file back down to a reasonable size, with everything in order. Move out of place items to temp location, free up space in right spot, move them back. Oh by the way, to free up space, you had to move other items to temp location. It's recursive and heavy-handed.

需要大量的磁盘访问和I/O操作才能正确地将索引文件压缩到一个合理的大小,并且所有内容都是有序的。将物品从放置的地方移到临时位置,在正确的位置释放空间,将它们移回来。哦,顺便说一下,为了腾出空间,你必须把其他物品移到临时位置。递归和严厉。

Therefore, if you have a very large number of items in a collection and that collection has items added and removed on a regular basis, the index may need to be rebuilt from scratch. Doing this would wipe the current index file and rebuild from the ground up - which is probably going to be faster than trying to do thousands of moves inside of the existing file. Rather than moving things around, it just writes them sequentially, from scratch.

因此,如果一个集合中有大量的项,并且该集合中定期添加和删除项,则可能需要重新构建索引。这样做将擦除当前索引文件并从头开始重新构建——这可能比尝试在现有文件中执行数千个操作要快。它不是移动物体,而是按顺序从头开始写。

Large Change In Collection Size

Giving everything I'm assuming above, a large change in the collection size would cause this kind of thrashing. If you have 10,000 documents in the collection and you delete 8,000 of them... well, now you have empty space in your index file where the 8,000 items used to be. MongoDB needs to move the remaining 2,000 items around in the physical file, to rebuild it in a compact form.

考虑到上面假设的所有内容,集合大小的巨大变化将导致这种抖动。如果你收集了10000个文档,你删除了8000个……现在,你的索引文件中有了以前8000个项目的空空间。MongoDB需要将剩下的2,000个项目移动到物理文件中,以紧凑的形式重新构建。

Instead of waiting around for 8,000 empty spaces to be cleaned up, it might be faster to rebuild from the ground up with the remaining 2,000 items.

与其等待8000个空地被清理,不如用剩下的2000件物品从地面上重建。

Conclusion? Maybe?

So, the documentation that you quoted is probably going to deal with "big data" needs or high thrashing collections and indexes.

因此,您所引用的文档可能会处理“大数据”需求或高抖动的集合和索引。

Also keep in mind that I'm making an educated guess based on what I know about indexing, disk allocation, file fragmentation, etc.

还要记住,我是根据我对索引、磁盘分配、文件碎片等等的了解来进行推测的。

My guess is that "most users" in the documentation, means 99.9% or more of mongodb collections don't need to worry about this.

我的猜测是,文档中的“大多数用户”意味着,99.9%或更多的mongodb集合不需要担心这个问题。

MongoDB specific case

According to MongoDB documentation:

根据MongoDB文档:

The remove() method does not remove the indexes

remove()方法不删除索引

So if you delete documents from a collection you are wasting disk space unless you rebuild the index for that collection.

因此,如果从集合中删除文档,就浪费了磁盘空间,除非重新构建该集合的索引。