我可以通过在列x上创建索引来优化从hugeTable查询中选择不同的x吗?

I have a huge table, having a much smaller number (by orders of magnitude) of distinct values on some column x.

我有一个很大的表，在某个列x上有一个更小的值(按大小顺序)。

I need to do a query like SELECT DISTINCT x FROM hugeTable, and I want to do this relatively fast.

我需要做一个查询，比如从hugeTable中选择不同的x，我想要做的比较快。

I did something like CREATE INDEX hugeTable_by_x ON hugeTable(x), but for some reason, even though the output is small, the query execution is not as fast. The query plan shows that 97% of the time is spent on Index Scan of hugeTable_by_x, with an estimated number of rows equal to the size of the entire table. This is followed by, among other things, a Hash Match operation.

我在hugeTable(x)上做了一些类似于创建索引hugeTable_by_x的操作，但是出于某种原因，即使输出很小，查询执行也没有那么快。查询计划显示97%的时间用于hugeTable_by_x的索引扫描，估计行数等于整个表的大小。接下来是哈希匹配操作。

Since I created an index on column x, can I not expect this query to run very quickly?

既然我在列x上创建了一个索引，难道我不能期望这个查询运行得很快吗?

Note that I'm using Microsoft SQL Server 2005.

注意，我使用的是Microsoft SQL Server 2005。

7 个解决方案

#1

This is likely not a problem of indexing, but one of data design. Normalization, to be precise. The fact that you need to query distinct values of a field, and even willing to add an index, is a strong indicator that the field should be normalized into a separate table with a (small) join key. Then the distinct values will be available immediately by scanning the much smaller lookup foreign table.

这可能不是索引的问题，而是数据设计的问题。正常化,精确。您需要查询字段的不同值，甚至愿意添加索引，这是一个强烈的指示，表明应该使用(小)连接键将字段规范化为一个单独的表。然后通过扫描小得多的查找外表就可以立即获得不同的值。

Update
As a workaround, you can create an indexed view on an aggregate by the 'distinct' field. COUNT_BIG is an aggregate that is allowed in indexed views:

作为一种变通方法，您可以通过“不同”字段在聚合上创建一个索引视图。COUNT_BIG是在索引视图中允许的集合:

create view vwDistinct
with schemabinding
as select x, count_big(*)
from schema.hugetable
group by x;

create clustered index cdxDistinct on vwDistinct(x);

select x from vwDistinct with (noexpand);

#2

SQL Server does not implement any facility to seek directly to the next distinct value in an index skipping duplicates along the way.

SQL Server不实现任何工具来直接查找索引中不重复的下一个不同值。

If you have many duplicates then you may be able to use a recursive CTE to simulate this. The technique comes from here. ("Super-fast DISTINCT using a recursive CTE"). For example:

如果您有许多重复，那么您可以使用递归CTE来模拟这一点。技术来自这里。(“使用递归CTE的超快特性”)。例如:

with recursivecte as (
  select min(t.x) as x
  from hugetable t
  union all
  select ranked.x
  from (
    select t.x,
           row_number() over (order by t.x) as rnk
    from hugetable t
    join recursivecte r
      on r.x < t.x
  ) ranked
  where ranked.rnk = 1
)
select *
from recursivecte
option (maxrecursion 0)

#3

If you know the values in advance and there is an index on column x (or if each value is likely to appear quickly on a seq scan of the whole table), it is much faster to query each one individually:

如果您预先知道了值，并且在x列上有一个索引(或者如果每个值可能很快地出现在整个表的seq扫描中)，那么单独查询每个值要快得多:

select vals.x
from [values] as vals (x)
where exists (select 1 from bigtable where bigtable.x = vals.x);

Proceeding using exists() will do as many index lookups as there are valid values.

继续使用exist()将执行有有效值的索引查找。

The way you've written it (which is correct if the values are not known in advance), the query engine will need to read the whole table and hash aggregate the mess to extract the values. (Which makes the index useless.)

按照您编写它的方式(如果没有预先知道值，这是正确的)，查询引擎将需要读取整个表并散列聚集这些混乱的值。(这使得指数毫无用处。)

#4

No. But there are some workarounds (excluding normalization):

不。但也有一些变通办法(不包括正常化):

Once the index is in place, then its possible to implement in SQL what the optimizer could be doing automatically:

一旦索引就位，就可以在SQL中实现优化器可以自动执行的操作:

https://*.com/a/29286754/538763 (multiple workarounds cited)

https://*.com/a/29286754/538763(多个工作区引用)

Other answers say you can normalize which would solve your issue but even once its normalized SQL Server still likes to perform a scan to find the max() within group(s). Workarounds:

其他的回答说您可以规范化它来解决您的问题，但是即使它的规范化SQL服务器仍然喜欢执行扫描来查找组内的max()。解决方法:

https://dba.stackexchange.com/questions/48848/efficiently-query-max-over-multiple-ranges?rq=1

#5

Possibly. Though it is not guaranteed - it entirely depends on the query.

可能。虽然不能保证，但这完全取决于查询。

I suggest reading this article by Gail Shaw (part 1 and part 2).

我建议阅读Gail Shaw的这篇文章(第1部分和第2部分)。

#6

When doing a SELECT DISTINCT on an indexed field, an index scan makes sense, as execution still has to scan each value in the index for the entire table (assuming no WHERE clause, as seems to be the case by your example).

当在索引字段上执行SELECT DISTINCT时，索引扫描是有意义的，因为执行仍然需要扫描整个表的索引中的每个值(假设没有WHERE子句，就像您的示例中那样)。

Indexes usually have more of an impact on WHERE conditions, JOINS, and ORDER BY clauses.

索引通常对条件、连接和ORDER BY子句有更大的影响。

#7

As per your description of the execution plan, I would believe it's the best possible execution.

根据你对执行计划的描述，我相信这是最好的执行。

The Index Scan reads the entire index as stored (not in index order), the HASH MATCH does the distinct.

索引扫描读取存储的整个索引(不是按索引顺序)，散列匹配执行不同的操作。

There might be other ways around your problem. In SQL Server, Indexed Views come to my mind. However, that might give you a big hit for write's on that table.

也许还有其他方法可以解决你的问题。在SQL Server中，我想到了索引视图。然而，这可能会给你的写作带来很大的影响。

#1