递归相关表达式如何加速不同的查询？

I found this post about speeding up distinct queries:

我发现这篇关于加快不同查询的帖子：

Super-fast DISTINCT using a recursive CTE:

使用递归CTE的超快速DISTINCT：

USE     tempdb;
GO
DROP    TABLE dbo.Test;
GO
CREATE  TABLE 
        dbo.Test 
        (
        data            INTEGER NOT NULL,
        );
GO
CREATE  CLUSTERED INDEX c ON dbo.Test (data);
GO
-- Lots of duplicated values
INSERT  dbo.Test WITH (TABLOCK)
        (data)
SELECT  TOP (5000000)
        ROW_NUMBER() OVER (ORDER BY (SELECT 0)) / 117329
FROM    master.sys.columns C1,
        master.sys.columns C2,
        master.sys.columns C3;
GO



SET     STATISTICS TIME ON;

-- 1591ms CPU
SELECT  DISTINCT 
        data
FROM    dbo.Test;

-- 15ms CPU

- 15ms CPU

WITH    RecursiveCTE
AS      (
        SELECT  data = MIN(T.data)
        FROM    dbo.Test T
        UNION   ALL
        SELECT  R.data
        FROM    (
                -- A cunning way to use TOP in the recursive part of a CTE Smile
                SELECT  T.data,
                        rn = ROW_NUMBER() OVER (ORDER BY T.data)
                FROM    dbo.Test T
                JOIN    RecursiveCTE R
                        ON  R.data < T.data
                ) R
        WHERE   R.rn = 1
        )
SELECT  *
FROM    RecursiveCTE
OPTION  (MAXRECURSION 0);

SET     STATISTICS TIME OFF;
GO
DROP    TABLE dbo.Test;

The recursive CTE is 100 times more efficient :-) This kind of speedup would be extremely valuable for my current project, but I am not sure in which cases this approach is beneficial.

递归CTE效率提高了100倍:-)这种加速对我目前的项目非常有价值，但我不确定这种方法在哪种情况下是有益的。

To be honest: I don't get why this speeds up the query that much and why the database cannot do this optimization itself. Can you explain how this works and why it is so effective?

说实话：我不明白为什么这会加速查询的速度以及为什么数据库本身无法进行优化。你能解释一下这是如何工作的以及它为何如此有效吗？

Edit: I see a similar effect on sybase, so this approach does not seem to be valid for sql-server only.

编辑：我在sybase上看到了类似的效果，所以这种方法似乎对sql-server无效。

Sub-question: is the recursive CTE useful for other data base systems as well?

子问题：递归CTE对其他数据库系统也有用吗？

2 个解决方案

#1

Paul White explained this "trick" in great detail in his post Performance Tuning the Whole Query Plan under Finding the Distinct Values section.

保罗·怀特在他的“调整不同价值观”部分的“调整整体查询计划”中详细解释了这个“技巧”。

Why the database cannot do this optimization itself?

为什么数据库本身无法进行优化？

Is the recursive CTE useful for other data base systems as well?

递归CTE对其他数据库系统也有用吗？

Optimizer is not perfect and it doesn't implement all possible techniques. People asked Microsoft to implement it. See this Connect item Implement Index Skip Scan. It was closed as Won't Fix, but that doesn't mean that it won't be addressed in the future. Other DBMSes may have implemented it (Connect item says that Oracle implements this optimization). If this kind of optimization is implemented in the DBMS engine, then this "trick" is not needed and optimizer would choose optimal method of calculating the result based on available statistics.

优化器并不完美，并没有实现所有可能的技术。人们要求微软实施它。请参阅此连接项目实施索引跳过扫描。它被关闭，因为Will Not Fix，但这并不意味着将来不会解决它。其他DBMS可能已经实现了它（Connect item表示Oracle实现了这种优化）。如果在DBMS引擎中实现这种优化，则不需要此“技巧”，优化器将根据可用统计信息选择计算结果的最佳方法。

I don't get why this speeds up the query that much.

我不明白为什么这会加快查询速度。

I am not sure in which cases this approach is beneficial

我不确定这种方法在哪些情况下是有益的

The simple DISTINCT query scans the whole index. "Scans" means that it reads each and every page of the index from the disk and aggregates the values in memory (or tempdb) to get a list of distinct values.

简单的DISTINCT查询扫描整个索引。 “扫描”表示它从磁盘读取索引的每个页面，并聚合内存（或tempdb）中的值以获取不同值的列表。

If you know that the table has a lot of rows, but only few different distinct values, then reading all these repeating values is a waste of time. Recursive CTE forces the server to seek an index for the first distinct value, then seek an index for the second value and so on. "Seek" means that server uses binary search in the index to find the value. Usually one seek requires reading of only few pages from disk. "Index" is a balanced tree.

如果您知道该表有很多行，但只有很少的不同值，那么读取所有这些重复值就是浪费时间。递归CTE强制服务器为第一个不同的值寻找索引，然后为第二个值寻找索引，依此类推。 “Seek”表示服务器在索引中使用二进制搜索来查找值。通常一次搜索只需要从磁盘读取几页。 “索引”是一棵平衡的树。

If the table has only few distinct values it is faster to seek few times, than read all pages of the index. On the other hand, if there are a lot of distinct values then it would be faster to read all pages sequentially, than seeking for each consecutive value. This should give you an idea in what cases this approach is beneficial.

如果表只有很少的不同值，那么寻找几次比读取索引的所有页面要快。另一方面，如果存在许多不同的值，那么顺序读取所有页面比寻找每个连续值更快。这应该可以让您了解这种方法在哪些情况下是有益的。

Obviously, if the table is small, it is faster to scan it. Only when the table becomes "large enough" you start seeing the difference in performance.

显然，如果表很小，扫描它会更快。只有当表格变得“足够大”时，才会开始看到性能上的差异。

There is a related question on dba.se: Is it possible to get seek based parallel plan for distinct/group by?

关于dba.se有一个相关的问题：是否有可能获得针对distinct / group by的基于搜索的并行计划？

#2

Noteable point when above script is run in my machine.

上述脚本在我的机器上运行时的注意事项。

Disctinct query= clustered scan is 94%

Disctinct query =聚集扫描率为94％

Recursive query=clustered scan is 14%

递归查询=聚集扫描为14％

This is the main reason

这是主要原因

Disctinct query

摘要查询

CPU time = 920 ms, elapsed time = 211 ms.

CPU时间= 920 ms，经过时间= 211 ms。

Recursive query

递归查询

CPU time = 0 ms, elapsed time = 64 ms.

CPU时间= 0 ms，经过时间= 64 ms。

In this example may be recursive option seem good.

在这个例子中可能是递归选项似乎很好。

but in real world example getting disctinct result using recursive may not be good idea.

但在现实世界的例子中，使用递归得到不明显的结果可能不是一个好主意。

Therefore you cannot say, "Super-fast DISTINCT using a recursive CTE:"

因此你不能说，“使用递归CTE的超快速DISTINCT：”

#1