如何防止SQL Server对扫描的行数进行平方?

I'm running a query over a table variable that holds 22 227 rows. The query used to take 2-3 seconds to complete (which I still think is too slow) but since I added another field to the ORDER BY clause in DENSE_RANK() it now completes in 4.5 minutes!

我在一个包含22 227行的表变量上运行查询。查询过去需要2-3秒完成(我仍然认为这太慢)，但是由于我在DENSE_RANK()中向ORDER BY子句添加了另一个字段，现在只需4.5分钟就完成了!

If I include [t2].[aisdt] with or without [t2].[aiID], the execution plan shows that it's scanning 494 039 529 rows, which is 22 227 squared. The following query generates the correct results, just much too slowly to be useful.

如果我包括(t2)。[aisdt]带[t2]或不带[aisdt]。执行计划显示它正在扫描494 039 529行，即22 227的平方。下面的查询生成正确的结果，只是太慢了，不能发挥作用。

SELECT MAX([t].[SetNum]) OVER (PARTITION BY NULL) AS [MaxSet]
      ,*
FROM (
    SELECT DENSE_RANK() OVER (ORDER BY [t2].[aisdt], [t2].[aiID]) AS [SetNum]
          ,[t2].*
    FROM (
        SELECT [aiID]
              ,COUNT(DISTINCT [acID]) AS [noac]
        FROM @Temp
        GROUP BY [aiID]
    ) [t1]
    JOIN @Temp [t2]
      ON [t2].[aiID] = [t1].[aiID]
    WHERE [t1].[noac] < [t2].[asm]
) [t]

Just to be clear, the culprit is the bold section in "DENSE_RANK() OVER (ORDER BY [t2].[aisdt], [t2].[aiID])". Removing this field (which needs to remain) drops the execution time back down to 2-3 seconds. I think it might have something to do with JOINing the table to itself on [aiID] but not [aisdt].

明确地说，罪魁祸首是“DENSE_RANK() / (ORDER BY [t2])”中的粗体部分。[aisdt],[t2]。[携带])”。删除这个字段(需要保留)将执行时间降低到2-3秒。我认为这可能与在[aiID]上加入表本身有关，但与[aisdt]无关。

How can I speed this query up to complete in the same time as before, or less?

如何在相同的时间或更短的时间内完成查询?

EDIT

Table definition:

表定义:

DECLARE @Temp TABLE (
    [aiID] INT NOT NULL INDEX [IX_Temp_aiID] -- not unique
    ,[aisdt] DATETIME NOT NULL INDEX [IX_Temp_aisdt] -- not unique
    ,[asm] INT NOT NULL
    ,[cpcID] INT NULL
    ,[cpce] VARCHAR(10) NULL
    ,[acID] INT NULL
    ,[ctvID] INT NULL
    ,[ct] VARCHAR(100) NULL
    ,[_36_other_non_matched_fields_] VARCHAR(MAX)

    ,UNIQUE ([aiID], [cpcID], [cpce], [acID], [ctvID], [ct])
)

[aisdt] is unique per [aiID], but there can be multiple [aiID]s with the same [aisdt].

[aisdt]对于[aiID]是唯一的，但是可以有多个[aiID]s具有相同的[aisdt]。

INSERT INTO @TEMP
VALUES (64, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah')
      ,(64, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah')
      ,(99, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 32, '', 32, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 32, '', 47, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 47, '', 32, NULL, NULL, 'blah')
      ,(42, '2017-04-14 16:00:00', 2, 47, '', 47, NULL, NULL, 'blah')
      ,(54, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah')
      ,(54, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah')
      ,(89, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 32, '', 32, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 32, '', 47, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 47, '', 32, NULL, NULL, 'blah')
      ,(32, '2017-04-14 16:00:00', 3, 47, '', 47, NULL, NULL, 'blah')

It must be sorted by [aisdt] (datetime) first, then [aiID], then numbered into sets based on [aiID].

它必须先按[aisdt] (datetime)排序，然后是[aiID]，然后根据[aiID]将其编号为集合。

I want to see:

我想看看:

5, 1, 54, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah'
5, 1, 54, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah'
5, 2, 64, '2017-03-23 10:00:00', 1, 17, '', NULL, NULL, NULL, 'blah'
5, 2, 64, '2017-03-23 10:00:00', 1, 34, '', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah'
5, 3, 89, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 25, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 16, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 76, 'Y', NULL, NULL, NULL, 'blah'
5, 4, 99, '2017-04-08 09:00:00', 1, 82, 'Y', NULL, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 32, '', 32, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 32, '', 47, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 47, '', 32, NULL, NULL, 'blah'
5, 5, 32, '2017-04-14 16:00:00', 3, 47, '', 47, NULL, NULL, 'blah'

2 个解决方案

#1

The main idea is taken from Partition Function COUNT() OVER possible using DISTINCT that @Jayvee pointed out with a small addition that would make it work when acID has NULL values.

主要思想是从Partition Function COUNT()中获取的，可以使用@Jayvee指出的与一个小的加法，当acID具有空值时，它可以工作。

Most likely you can remove all indexes from your @Temp table, the server will have to sort it in several different ways for different window functions anyway, but there is no self-join, so it should be faster.

很有可能您可以从@Temp表中删除所有索引，服务器将不得不为不同的窗口函数以不同的方式对其进行排序，但是不存在自连接，因此它应该更快。

The plan will have many sorts and they also can be slow, especially when engine underestimates the number of rows in a table. And table variable is exactly this case. Optimiser thinks that table variable has only 1 row. So, I'd recommend to use a classic #Temp table here, even without indexes.

该计划将有许多种类，而且它们也可能很慢，特别是当engine低估了表中的行数时。表变量就是这种情况。Optimiser认为表变量只有一行。因此，我建议在这里使用一个经典的#Temp表，即使没有索引。

An index on (aiID, acID) should help, but there will be other sorts any way.

关于(aiID, acID)的索引应该会有所帮助，但是还有其他的方法。

WITH
CTE_Counts
AS
(
    SELECT
        *
        -- use DENSE_RANK() to calculate COUNT(DISTINCT)
        , DENSE_RANK() OVER (PARTITION BY [aiID] ORDER BY [acID])
        + DENSE_RANK() OVER (PARTITION BY [aiID] ORDER BY [acID] DESC)
        -- subtract extra 1 if acID has NULL values within the partition
        - MAX(CASE WHEN [acID] IS NULL THEN 1 ELSE 0 END) OVER (PARTITION BY [aiID])
        - 1 AS [noac]
    FROM @Temp
)
,CTE_SetNum
AS
(
    SELECT
        *
        , DENSE_RANK() OVER (ORDER BY [aisdt], [aiID]) AS [SetNum]
    FROM CTE_Counts
    WHERE [noac] < [asm]
)
SELECT
    *
    , MAX([SetNum]) OVER () AS [MaxSet]
FROM CTE_SetNum
ORDER BY
    [aisdt]
    ,[aiID]
    ,[SetNum]
;

#2

Index as suggested in the comments would definitely play a major part but also I think you can re-write the query without self join in this way:

评论中提到的索引肯定会起到很大的作用，但我认为你也可以不用self join来重新编写查询:

   SELECT MAX([t].[SetNum]) OVER (PARTITION BY NULL) AS [MaxSet]
      ,*
FROM (
    SELECT DENSE_RANK() OVER (ORDER BY [t1].[aisdt], [t1].[aiID]) AS [SetNum]
          ,[t1].*
    FROM (
        SELECT * ,dense_rank() over(partition by aiID order by [acID]) - 
        dense_rank() over(partition by aiID order by [acID]) - 1 AS [noac]
        FROM @Temp
    ) [t1]
    WHERE [t1].[noac] < [t1].[asm]
) [t]

#1