为什么Spark SQL认为索引的支持不重要？

Quoting the Spark DataFrames, Datasets and SQL manual:

引用Spark DataFrames，Datasets和SQL手册：

A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are less important due to Spark SQL’s in-memory computational model. Others are slotted for future releases of Spark SQL.

Spark中还没有包含一些Hive优化。由于Spark SQL的内存计算模型，其中一些（如索引）不太重要。其他版本用于未来的Spark SQL版本。

Being new to Spark, I'm a bit baffled by this for two reasons:

作为Spark的新手，我有点困惑，原因有两个：

Spark SQL is designed to process Big Data, and at least in my use case the data size far exceeds the size of available memory. Assuming this is not uncommon, what is meant by "Spark SQL’s in-memory computational model"? Is Spark SQL recommended only for cases where the data fits in memory?

Spark SQL旨在处理大数据，至少在我的用例中，数据大小远远超过可用内存的大小。假设这并不罕见，“Spark SQL的内存计算模型”是什么意思？ Spark SQL是否仅适用于数据适合内存的情况？
Even assuming the data fits in memory, a full scan over a very large dataset can take a long time. I read this argument against indexing in in-memory database, but I was not convinced. The example there discusses a scan of a 10,000,000 records table, but that's not really big data. Scanning a table with billions of records can cause simple queries of the "SELECT x WHERE y=z" type take forever instead of returning immediately.

即使假设数据适合内存，对非常大的数据集进行全扫描也需要很长时间。我在内存数据库中读取了这个反对索引的论点，但我并不相信。那里的例子讨论了一个10,000,000记录表的扫描，但这不是真正的大数据。扫描具有数十亿条记录的表可能导致“SELECT x WHERE y = z”类型的简单查询永远不会立即返回。

I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.

我知道索引有一些缺点，如较慢的INSERT / UPDATE，空间要求等。但在我的使用案例中，我首先处理并将大批数据加载到Spark SQL中，然后将这些数据作为一个整体进行探索，无需进一步修改。 Spark SQL对于初始分布式处理和数据加载很有用，但缺乏索引使得交互式探索比我预期的更慢，更麻烦。

I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?

我想知道为什么Spark SQL团队认为索引不重要到它的路线图。是否有不同的使用模式可以提供索引的好处而无需独立实现等效的东西？

2 个解决方案

#1

Indexing input data

索引输入数据

The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
索引外部数据源不在Spark范围内的根本原因是Spark不是数据管理系统而是批处理数据处理引擎。由于它不拥有它正在使用的数据，因此无法可靠地监视变化，因此无法维护索引。
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
如果数据源支持索引，则Spark可以通过谓词下推等机制间接利用它。

Indexing Distributed Data Structures:

索引分布式数据结构：

standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
标准索引技术需要持久且定义明确的数据分布，但Spark中的数据通常是短暂的，其精确分布是不确定的。
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
通过适当的分区与列式存储和压缩相结合实现的高级数据布局可以提供非常有效的分布式访问，而无需创建，存储和维护索引的开销。这是不同的内存中柱状系统使用的常见模式。

That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.

也就是说，某些形式的索引结构确实存在于Spark生态系统中。最值得注意的是Databricks在其平台上提供数据跳过索引。

Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.

其他项目，如Succinct（今天大多不活跃）采用不同的方法，并使用高级压缩技术和随机访问支持。

Of course this raises a question - if require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.

当然，这提出了一个问题 - 如果需要有效的随机访问，为什么不使用从一开始就设计为数据库的系统。有很多选择，包括至少一些由Apache基金会维护的选择。同时Spark作为一个项目不断发展，您使用的报价可能无法完全反映Spark未来的发展方向。

#2

In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".

通常，索引的效用充其量是有问题的。相反，数据分区更重要。它们是非常不同的东西，并且仅仅因为您选择的数据库支持索引并不意味着它们在Spark试图做的事情中有意义。它与“在记忆中”无关。

So what is an index, anyway?

那么什么是指数呢？

Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.

在永久存储是疯狂昂贵的（而不是基本上免费的）关系数据库系统的时代，所有这些都是关于最小化永久存储的使用。必要时，关系模型将记录分成多个部分 - 规范化数据 - 并将它们存储在不同的位置。要读取客户记录，您可能会阅读客户表，customerType表，从地址表中取出几个条目等。如果您有一个解决方案需要您阅读整个表格以找到您想要的内容，是非常昂贵的，因为你必须扫描这么多表。

But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.

但这不是做事的唯一方法。如果您不需要具有固定宽度的列，则可以将整个数据集存储在一个位置。您无需在一堆表上执行全表扫描，而只需在单个表上执行此操作。而且这并不像你想象的那么糟糕，特别是如果你可以对数据进行分区。

40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?

40年后，物理定律发生了变化。硬盘随机读/写速度和线性读/写速度有很大差异。基本上每个磁盘可以进行350次磁头移动。（或多或少，但这是一个很好的平均数。）另一方面，单个磁盘驱动器可以读取大约每秒100 MB。那是什么意思？

Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.

算一算并考虑一下 - 这意味着如果每个磁盘头的移动读数少于300KB，则会限制驱动器的吞吐量。

Seriouusly. Think about that a second.

Seriouusly。想一想。

The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.

索引的目标是允许您将磁盘头移动到所需磁盘上的精确位置，然后只读取该记录 - 例如，只记录作为客户记录一部分加入的地址记录。我说，那没用。

If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.

如果我正在设计一个基于现代物理学的索引，它只需要让我在目标数据块的100KB左右（假设我的数据已经以大块的方式布局 - 但我们无论如何都在讨论理论）。根据上面的数字，比这更精确只是浪费。

Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.

现在回到标准化的表格设计。假设客户记录实际上分为5个表中的6行。 6总磁盘头移动（我假设索引缓存在内存中，因此没有磁盘移动）。这意味着我可以读取1.8 MB的线性/非标准化客户记录，并且效率也一样高。

And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.

客户历史怎么样？假设我不想仅仅看到客户今天的样子 - 想象一下，我想要完整的历史记录，还是历史的一部分？将上面的所有内容乘以10或20即可得到图片。

What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.

什么比索引更好的是数据分区 - 确保所有客户记录最终都在一个分区中。通过单磁盘头移动，我可以读取整个客户历史记录。一个磁头移动。

Tell me again why you want indexes.

再次告诉我你为什么要索引。

Indexes vs ___ ?

索引vs ___？

Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?

不要误解我的意思 - “预先烹饪”你的搜索是有价值的。但物理定律提出了比传统指数更好的方法。而不是将客户记录存储在一个位置，并创建指向它的指针 - 索引 - 为什么不将记录存储在多个位置？

Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.

请记住，磁盘空间基本上是免费的。而不是试图最小化我们使用的存储量 - 关系模型的过时工件 - 只需将您的磁盘用作搜索缓存。

If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.

如果您认为某人希望按地理位置和销售代表查看客户，则可以以优化这些搜索的方式存储客户记录的多个副本。就像我说的那样，使用像内存缓存一样的磁盘。不是通过将不同的持久数据组合在一起来构建内存缓存，而是构建持久数据以镜像内存缓存，因此您只需要读取它。事实上，甚至不想将其存储在内存中 - 只需在每次需要时直接从磁盘读取它。

If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!

如果您认为这听起来很疯狂，请考虑这一点 - 如果您将其缓存在内存中，您可能会将其缓存两次。您的操作系统/驱动器控制器可能使用主内存作为缓存。不要打扰缓存数据，因为其他人已经存在了！

But I digress...

但我离题了......

Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

简而言之，Spark绝对支持正确的索引类型 - 从原始数据创建复杂的派生数据以使未来使用更加高效的能力。它只是不按你想要的方式去做。

#1