SQL查询 - 需要提高性能

I have a data load scenario where I create dynamic sql query to pull data and cache in our service. There is 1 table that contains all product data : ProductHistory (47 columns, 200,000 records + and will keep growing)

我有一个数据加载场景,我创建动态SQL查询以在我们的服务中提取数据和缓存。有1个表包含所有产品数据:ProductHistory(47列,200,000条记录+并将继续增长)

What I need: Get the latest products by using the maximum id, maximum version and maximum changeid.

我需要的是:使用最大ID,最大版本和最大更改来获取最新产品。

First Attempt:

SELECT distinct Product.* FROM ProductHistory product 
WHERE  product.version = 
(SELECT max(version) from ProductHistory p2 where product.Id = p2.Id 
  and product.changeId = 
(SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))

This took more than 2.51 minutes.

这花了超过2.51分钟。

Other Failed Attempt:

其他失败的尝试:

select distinct product.* from ProductHistory product 
where CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal) = 
(select MAX(CAST(CAST(id as nvarchar)+'0'+CAST(Version as nvarchar)+'0'+CAST(changeid as nvarchar) as decimal)) from ProductHistory p2 
where product.Id = p2.Id)

It basically uses the same principle as when you order dates, concatenating the numbers ordered by relevance.

它基本上使用与您订购日期时相同的原则,连接按相关性排序的数字。

For example 11 Jun 2007 = 20070711
And in our case: Id = 4 , version = 127, changeid = 32   => 40127032
The zeros are there not to mix up the 3 different ids

But this one takes 3.10 minutes !!! :(

但是这个需要3.10分钟!!! :(

So, I basically need a way to make my first attempt query better by any chance. I was also wondering with such amount of data, is this the best speed of retrieval that I should expect ?

所以,我基本上需要一种方法来让我的第一次尝试查询更好。我也想知道这么多的数据,这是我应该期待的最佳检索速度吗?

I ran sp_helpindex ProductHistory and found out the indexes as below :

我运行sp_helpindex ProductHistory并找到索引如下:

PK_ProductHistoryNew - clustered, unique, primary key located on PRIMARY- Id, Version

PK_ProductHistoryNew - 位于PRIMARY-Id,Version上的集群唯一主键
I wrapped the first query in a SP but still no change.

我在SP中包装了第一个查询,但仍然没有变化。

So, wondering by what other means we can improve the performance of this operation ?

那么,想知道通过其他方式我们可以改善这种操作的性能吗?

Thanks, Mani p.s : I am just running these queries in SQL management stuido to see the time.

谢谢,Mani p.s:我只是在SQL管理stuido中运行这些查询来查看时间。

8 个解决方案

#1

Run the query from Sql Server Management Studio and look at the query plan to see where the bottle neck is. Any place you see a "table scan" or "index scan" it has to go through all data to find what it is looking for. If you create appropriate indexes that can be used for these operations it should increase performance.

从Sql Server Management Studio运行查询并查看查询计划以查看瓶颈的位置。在任何你看到“表扫描”或“索引扫描”的地方,它必须通过所有数据来查找它正在寻找的内容。如果您创建可用于这些操作的适当索引,则应该提高性能。

#2

Some things I see:

我看到的一些事情:

Is the DISTINCT necessary? If you do a DISTINCT * it's unlikely to have any benefit but it will have overhead to check for duplicates across all fields.

DISTINCT是否必要?如果您执行DISTINCT *它不太可能有任何好处,但它将有开销来检查所有字段中的重复项。

Instead of having two subselects in your WHERE clause, JOIN to a derived table. This should process only once. I suspect your WHERE clause is processing multiple times.

不要在WHERE子句中有两个子选择,而是JOIN到派生表。这应该只处理一次。我怀疑你的WHERE子句正在处理多次。

<-- -->

SELECT Product.* 
FROM ProductHistory product 
INNER JOIN ( SELECT P.Id, 
                    MAX(p.version) as [MaxVer], 
                    MAX(p.Changeid) as [MaxChange]
             FROM Product p
             GROUP BY p.ID) SubQ
    ON SubQ.ID = product.ID
    AND SubQ.MaxChange = Product.ChangeID
    AND SubQ.MaxVer = Product.Version

You should also have an index on Id, Version, ChangeID for this.

您还应该在Id,Version,ChangeID上有一个索引。

#3

Well, storing everything in the table is not the way to do. Better is to store the last version in a table and use another one (with the same structure) for the history (as I guess you are more interested in current products than old ones). And concept issues will create many workarounds...

好吧,将所有内容存储在表中并不是可行的方法。更好的是将最后一个版本存储在一个表中,并使用另一个版本(具有相同的结构)用于历史记录(因为我猜你对当前产品比对旧产品更感兴趣)。而概念问题将创建许多变通方法......

Also, do not use DISTINCT because it often hides issues in the query (usually, if duplicates are retrieved, it means you can optimize better).

另外,不要使用DISTINCT,因为它经常隐藏查询中的问题(通常,如果检索到重复项,则意味着您可以更好地进行优化)。

Now, the best part: how to resolve your problem? I guess you should use the grouping principle giving something like this:

现在,最好的部分:如何解决您的问题?我想你应该使用分组原则给出这样的东西:

SELECT max(id), max(version), max(changeid) 
  FROM ProductHistory p
  WHERE <filter if necessary for old products or anything else>
  GROUP BY version, changeid
  HAVING version = max(version)
     AND changeid = max(changeid)
     AND id = max(id)

But, if I look at your PK, I'm surprised, the changeid is not relevant as you should deal with the id and version only...

但是,如果我看看你的PK,我很惊讶,因为你应该只处理id和版本,所以changeid不相关......

I am not sure if my request is fully correct because I can not test but I guess you can do some testings.

我不确定我的请求是否完全正确,因为我无法测试,但我想你可以做一些测试。

#4

I think you need an index on (Id, changeId, version) for this query. Please provide the table definition, the indexes on the table now and the query plan for your query.

我认为这个查询需要一个索引(Id,changeId,version)。请提供表定义,现在表上的索引以及查询的查询计划。

#5

This is getting a bit funky, but I wonder if partitioning would work:

这有点时髦,但我想知道分区是否有效:

  SELECT Id
  FROM (
      SELECT Id,
      MAX(version) OVER (PARTITION BY changeId) max_version
      FROM ProductHistory
  ) s
  where version = s.max_version

#6

I have a feeling this query will take longer as they number of rows increases, but it's worth a shot:

我觉得这个查询会花费更长的时间,因为它们的行数会增加,但值得一试:

SELECT * FROM 
(
SELECT Col1, Col2, Col3,
ROW_NUMBER() OVER (PARTITION BY ProductHistory.Id ORDER BY Version DESC, ChangeID DESC) AS RowNumber 
FROM ProductHistory
)
WHERE RowNumber = 1

#7

Try this CTE, it should be the fastest option possible and you probably won't even need indexes to get great speed:

尝试这个CTE,它应该是最快的选择,你可能甚至不需要索引来获得很快的速度:

with mysuperfastcte as (
 select product.*, 
 row_number() over (partition by id order by version desc) as versionorder,
 row_number() over (partition by id order by changeid desc) as changeorder 
 from ProductHistory as product
)
select distinct product.*
from mysuperfastcte
where versionorder = 1
and changeorder = 1;

NB. I think you may have a bug at this point in your code so please confirm and double check the results you are expecting with my code:

NB。我想你的代码中可能有一个错误,所以请确认并仔细检查我的代码所期望的结果:

  and product.changeId =  (SELECT max(changeid) from ProductHistory p3 where p2.changeId = p3.changeId))

you are trying to get max(changeid) using a correlated subquery but you are also joining on changeid - that is the same as just getting every row. Presumably you didn't intend that?

你试图使用相关的子查询获得max(changeid),但你也加入了changeid - 这与获取每一行相同。想必你不打算这样?

Also - obviously reduce the number of columns you are returning to just those you need and then run the following before running your query and check the messages output:

另外 - 显然减少了返回到您需要的列的数量,然后在运行查询之前运行以下命令并检查消息输出:

SET STATISTICS IO ON

Look for tables with high logical reads and figure out where an index will help you.

查找具有高逻辑读取的表,并找出索引将帮助您的位置。

Hint: If my code works for you then depending on the columns you need you could do:

提示:如果我的代码适合您,那么根据您需要的列,您可以执行以下操作:

create index ix1 (id, version desc) include (changeid, .... ) on ProductHistory.

在ProductHistory上创建索引ix1(id,版本desc)include(changeid,....)。

I hope this helps!

我希望这有帮助!

#8

-1

Generaly speaking, select max() needs to sort through the whole table. And you are doing it twice

通常来说,选择max()需要对整个表进行排序。而你正在做两次

SELECT TOP 1 is way faster, but you need to make sure your index is right and you have a correct ORDER BY. See if you can play with that.

SELECT TOP 1更快,但您需要确保您的索引正确并且您有正确的ORDER BY。看看你是否可以玩它。

#1

#2