每个N的最新记录的最佳执行查询

时间:2020-11-28 12:27:57

Here is the scenario I find myself in.

这是我发现自己所处的情形。

I have a reasonably big table that I need to query the latest records from. Here is the create for the essential columns for the query:

我有一个相当大的表,需要查询最近的记录。以下是为查询创建的基本列:

CREATE TABLE [dbo].[ChannelValue](
   [ID] [bigint] IDENTITY(1,1) NOT NULL,
   [UpdateRecord] [bit] NOT NULL,
   [VehicleID] [int] NOT NULL,
   [UnitID] [int] NOT NULL,
   [RecordInsert] [datetime] NOT NULL,
   [TimeStamp] [datetime] NOT NULL
   ) ON [PRIMARY]
GO

The ID column is a Primary Key and there is a non-Clustered index on VehicleID and TimeStamp

ID列是一个主键,车载ID和时间戳上有一个非聚集索引

CREATE NONCLUSTERED INDEX [IX_ChannelValue_TimeStamp_VehicleID] ON [dbo].[ChannelValue] 
(
    [TimeStamp] ASC,
    [VehicleID] ASC
)ON [PRIMARY]
GO

The table I'm working on to optimise my query is a little over 23 million rows and is only a 10th of the sizes the query needs to operate against.

我正在优化查询的表有2300多万行,仅是查询需要操作的大小的十分之一。

I need to return the latest row for each VehicleID.

我需要为每辆车返回最近的一排。

I've been looking through the responses to this question here on * and I've done a fair bit of Googling and there seem to be 3 or 4 common ways of doing this on SQL Server 2005 and upwards.

在*上,我一直在研究这个问题的答案,我在google上做了一些搜索,在SQL Server 2005和上面有3到4种常见的方法。

So far the fastest method I've found is the following query:

到目前为止,我发现的最快的方法是以下查询:

SELECT cv.*
FROM ChannelValue cv
WHERE cv.TimeStamp = (
SELECT
    MAX(TimeStamp)
FROM ChannelValue
WHERE ChannelValue.VehicleID = cv.VehicleID
)

With the current amount of data in the table it takes about 6s to execute which is within reasonable limits but with the amount of data the table will contain in the live environment the query begins to perform too slow.

根据表中当前的数据量,大约需要6s来执行,这在合理的范围内,但是随着表将包含在活动环境中的数据量,查询开始执行得太慢。

Looking at the execution plan my concern is around what SQL Server is doing to return the rows.

在查看执行计划时,我关心的是SQL Server如何返回行。

I cannot post the execution plan image because my Reputation isn't high enough but the index scan is parsing every single row within the table which is slowing the query down so much.

我不能发布执行计划映像,因为我的声誉不够高,但是索引扫描正在解析表中的每一行,这使查询速度减慢了很多。

每个N的最新记录的最佳执行查询

I've tried rewriting the query with several different methods including using the SQL 2005 Partition method like this:

我尝试过用几种不同的方法重写查询,包括使用SQL 2005分区方法:

WITH cte
AS (
    SELECT *,
    ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC) AS seq
     FROM ChannelValue
)

SELECT
   VehicleID,
   TimeStamp,
   Col1
FROM cte
WHERE seq = 1

But the performance of that query is even worse by quite a large magnitude.

但是这个查询的性能在很大程度上更差。

I've tried re-structuring the query like this but the result speed and query execution plan is nearly identical:

我尝试过重新构造查询,但结果速度和查询执行计划几乎相同:

SELECT cv.*
FROM (
   SELECT VehicleID
    ,MAX(TimeStamp) AS [TimeStamp]
   FROM ChannelValue
   GROUP BY VehicleID
) AS [q]
INNER JOIN ChannelValue cv
   ON cv.VehicleID = q.VehicleID
   AND cv.TimeStamp = q.TimeStamp

I have some flexibility available to me around the table structure (although to a limited degree) so I can add indexes, indexed views and so forth or even additional tables to the database.

我可以在表结构周围提供一些灵活性(尽管程度有限),因此我可以向数据库添加索引、索引视图等等,甚至是额外的表。

I would greatly appreciate any help at all here.

我非常感谢这里的任何帮助。

Edit Added the link to the execution plan image.

编辑添加到执行计划映像的链接。

3 个解决方案

#1


7  

Depends on your data (how many rows are there per group?) and your indexes.

取决于您的数据(每个组有多少行?)和索引。

See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.

有关3种方法的性能比较,请参阅优化每个组的TOP N查询。

In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do

在你的案例中,只有少量的车辆,我将在车辆id、时间戳和do上添加一个索引。

SELECT CA.*
FROM   Vehicles V
       CROSS APPLY (SELECT TOP 1 *
                    FROM   ChannelValue CV
                    WHERE  CV.VehicleID = V.VehicleID
                    ORDER  BY TimeStamp DESC) CA  

#2


0  

If your records are inserted sequentially, replacing TimeStamp in your query with ID may make a difference.

如果记录是按顺序插入的,那么用ID替换查询中的时间戳可能会有所不同。

As a side note, how many records is this returning? Your delay could be network overhead if you are getting hundreds of thousands of rows back.

顺便问一下,这个返回了多少条记录?如果您返回成千上万的行,那么您的延迟可能是网络开销。

#3


0  

Try this:

试试这个:

SELECT SequencedChannelValue.* -- Specify only the columns you need, exclude the SequencedChannelValue
FROM
    (
        SELECT 
            ChannelValue.*,   -- Specify only the columns you need
            SeqValue = ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC)
        FROM ChannelValue
    ) AS SequencedChannelValue
WHERE SequencedChannelValue.SeqValue = 1

A table or index scan is expected, because you're not filtering data in any way. You're asking for the latest TimeStamp for all VehicleIDs - the query engine HAS to look at every row to find the latest TimeStamp.

表或索引扫描是必需的,因为您没有以任何方式过滤数据。您要求的是所有车辆的最新时间戳——查询引擎必须查看每一行以找到最新的时间戳。

You can help it out by narrowing the number of columns being returned (don't use SELECT *), and by providing an index that consists of VehicleID + TimeStamp.

您可以通过缩小返回的列的数量(不使用SELECT *)以及提供包含车载id +时间戳的索引来帮助解决这个问题。

#1


7  

Depends on your data (how many rows are there per group?) and your indexes.

取决于您的数据(每个组有多少行?)和索引。

See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.

有关3种方法的性能比较,请参阅优化每个组的TOP N查询。

In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do

在你的案例中,只有少量的车辆,我将在车辆id、时间戳和do上添加一个索引。

SELECT CA.*
FROM   Vehicles V
       CROSS APPLY (SELECT TOP 1 *
                    FROM   ChannelValue CV
                    WHERE  CV.VehicleID = V.VehicleID
                    ORDER  BY TimeStamp DESC) CA  

#2


0  

If your records are inserted sequentially, replacing TimeStamp in your query with ID may make a difference.

如果记录是按顺序插入的,那么用ID替换查询中的时间戳可能会有所不同。

As a side note, how many records is this returning? Your delay could be network overhead if you are getting hundreds of thousands of rows back.

顺便问一下,这个返回了多少条记录?如果您返回成千上万的行,那么您的延迟可能是网络开销。

#3


0  

Try this:

试试这个:

SELECT SequencedChannelValue.* -- Specify only the columns you need, exclude the SequencedChannelValue
FROM
    (
        SELECT 
            ChannelValue.*,   -- Specify only the columns you need
            SeqValue = ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC)
        FROM ChannelValue
    ) AS SequencedChannelValue
WHERE SequencedChannelValue.SeqValue = 1

A table or index scan is expected, because you're not filtering data in any way. You're asking for the latest TimeStamp for all VehicleIDs - the query engine HAS to look at every row to find the latest TimeStamp.

表或索引扫描是必需的,因为您没有以任何方式过滤数据。您要求的是所有车辆的最新时间戳——查询引擎必须查看每一行以找到最新的时间戳。

You can help it out by narrowing the number of columns being returned (don't use SELECT *), and by providing an index that consists of VehicleID + TimeStamp.

您可以通过缩小返回的列的数量(不使用SELECT *)以及提供包含车载id +时间戳的索引来帮助解决这个问题。