SQL字符串比较速度'like' vs 'patindex'

I had a query as follows (simplified)...

我的查询如下(简化)…

SELECT     *
FROM       table1 AS a
INNER JOIN table2 AS b ON (a.name LIKE '%' + b.name + '%')

For my dataset this was taking around 90 seconds to execute, so I have been looking for ways of speeding it up. For no good reason, I thought I'd try PATINDEX instead of LIKE...

对于我的数据集，这需要大约90秒的时间来执行，所以我一直在寻找加速它的方法。没有什么好理由，我想我应该试试PATINDEX，而不是……

SELECT     *
FROM       table1 AS a
INNER JOIN table2 AS b ON (PATINDEX('%' + b.name + '%', a.name) > 0)

On the same dataset this executes in the blink of an eye and returns the same results.

在相同的数据集中，它会在眨眼之间执行，并返回相同的结果。

Can anyone explain why LIKE is so much slower than PATINDEX? Given that LIKE is just returning a BOOLEAN whereas PATINDEX is returning the actual location I would have expected the latter to be slower if anything, or is it simply a matter of how efficiently the two functions have been written?

有人能解释为什么比PATINDEX慢得多吗?考虑到它只是返回一个布尔值而PATINDEX返回的是实际位置，我希望后者会慢一些，或者只是简单地说这两个函数是如何有效地写出来的?

Ok, here is each query in full, followed by its execution plan. "#StakeholderNames" is just a temp table of likely names which I am matching against.

好，下面是每个查询的全文，后面是它的执行计划。“# stakeholder holdernames”只是一个临时表，其中包含了我正在匹配的可能的名称。

I have pulled back the live data and run each query several times. The first is taking about 17 seconds (so somewhat less than the original 90 seconds on the live database) and the second under 1 second...

我已经提取了实时数据并多次运行每个查询。第一个是大约17秒(比实时数据库上最初的90秒稍短)，第二个是1秒以下……

SELECT              sh.StakeholderID,
                    sh.HoldingID,
                    i.AgencyCommissionImportID,
                    1

    FROM            AgencyCommissionImport AS i
    INNER JOIN      #StakeholderNames AS sn ON REPLACE(REPLACE(i.ClientName,' ',''), ',','') LIKE '%' + sn.Name + '%'
    INNER JOIN      Holding AS h ON (h.ProviderName = i.Provider) AND (h.HoldingReference = i.PlanNumber)
    INNER JOIN      StakeholderHolding AS sh ON (sn.StakeholderID = sh.StakeholderID) AND (h.HoldingID = sh.HoldingID)
    WHERE           i.AgencyCommissionFileID = @AgencyCommissionFileID
                AND (i.MatchTypeID = 0)
                AND ((i.MatchedHoldingID IS NULL)
                    OR (i.MatchedStakeholderID IS NULL))

   |--Table Insert(OBJECT:([tempdb].[dbo].[#Results]), SET:([#Results].[StakeholderID] = [AttivoGroup_copy].[dbo].[StakeholderHolding].[StakeholderID] as [sh].[StakeholderID],[#Results].[HoldingID] = [AttivoGroup_copy].[dbo].[StakeholderHolding].[HoldingID] as [sh].[HoldingID],[#Results].[AgencyCommissionImportID] = [AttivoGroup_copy].[dbo].[AgencyCommissionImport].[AgencyCommissionImportID] as [i].[AgencyCommissionImportID],[#Results].[MatchTypeID] = [Expr1014],[#Results].[indx] = [Expr1013]))
        |--Compute Scalar(DEFINE:([Expr1014]=(1)))
             |--Compute Scalar(DEFINE:([Expr1013]=getidentity((1835869607),(2),N'#Results')))
                  |--Top(ROWCOUNT est 0)
                       |--Hash Match(Inner Join, HASH:([h].[ProviderName], [h].[HoldingReference])=([i].[Provider], [i].[PlanNumber]), RESIDUAL:([AttivoGroup_copy].[dbo].[Holding].[ProviderName] as [h].[ProviderName]=[AttivoGroup_copy].[dbo].[AgencyCommissionImport].[Provider] as [i].[Provider] AND [AttivoGroup_copy].[dbo].[Holding].[HoldingReference] as [h].[HoldingReference]=[AttivoGroup_copy].[dbo].[AgencyCommissionImport].[PlanNumber] as [i].[PlanNumber] AND [Expr1015] like [Expr1016]))
                            |--Nested Loops(Inner Join, OUTER REFERENCES:([sh].[HoldingID]))
                            |    |--Nested Loops(Inner Join, OUTER REFERENCES:([sn].[StakeholderID]))
                            |    |    |--Compute Scalar(DEFINE:([Expr1016]=('%'+#StakeholderNames.[Name] as [sn].[Name])+'%', [Expr1017]=LikeRangeStart(('%'+#StakeholderNames.[Name] as [sn].[Name])+'%'), [Expr1018]=LikeRangeEnd(('%'+#StakeholderNames.[Name] as [sn].[Name])+'%'), [Expr1019]=LikeRangeInfo(('%'+#StakeholderNames.[Name] as [sn].[Name])+'%')))
                            |    |    |    |--Table Scan(OBJECT:([tempdb].[dbo].[#StakeholderNames] AS [sn]))
                            |    |    |--Clustered Index Seek(OBJECT:([AttivoGroup_copy].[dbo].[StakeholderHolding].[PK_StakeholderHolding] AS [sh]), SEEK:([sh].[StakeholderID]=#StakeholderNames.[StakeholderID] as [sn].[StakeholderID]) ORDERED FORWARD)
                            |    |--Clustered Index Seek(OBJECT:([AttivoGroup_copy].[dbo].[Holding].[PK_Holding] AS [h]), SEEK:([h].[HoldingID]=[AttivoGroup_copy].[dbo].[StakeholderHolding].[HoldingID] as [sh].[HoldingID]) ORDERED FORWARD)
                            |--Compute Scalar(DEFINE:([Expr1015]=replace(replace([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[ClientName] as [i].[ClientName],' ',''),',','')))
                                 |--Clustered Index Scan(OBJECT:([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[PK_AgencyCommissionImport] AS [i]), WHERE:([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[AgencyCommissionFileID] as [i].[AgencyCommissionFileID]=[@AgencyCommissionFileID] AND [AttivoGroup_copy].[dbo].[AgencyCommissionImport].[MatchTypeID] as [i].[MatchTypeID]=(0) AND ([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[MatchedHoldingID] as [i].[MatchedHoldingID] IS NULL OR [AttivoGroup_copy].[dbo].[AgencyCommissionImport].[MatchedStakeholderID] as [i].[MatchedStakeholderID] IS NULL)))


SELECT              sh.StakeholderID,
                    sh.HoldingID,
                    i.AgencyCommissionImportID,
                    1

    FROM            AgencyCommissionImport AS i
    INNER JOIN      #StakeholderNames AS sn ON (PATINDEX('%' + sn.Name + '%', REPLACE(REPLACE(i.ClientName,' ',''), ',','')) > 0)
    INNER JOIN      Holding AS h ON (h.ProviderName = i.Provider) AND (h.HoldingReference = i.PlanNumber)
    INNER JOIN      StakeholderHolding AS sh ON (sn.StakeholderID = sh.StakeholderID) AND (h.HoldingID = sh.HoldingID)
    WHERE           i.AgencyCommissionFileID = @AgencyCommissionFileID
                AND (i.MatchTypeID = 0)
                AND ((i.MatchedHoldingID IS NULL)
                    OR (i.MatchedStakeholderID IS NULL))

   |--Table Insert(OBJECT:([tempdb].[dbo].[#Results]), SET:([#Results].[StakeholderID] = [AttivoGroup_copy].[dbo].[StakeholderHolding].[StakeholderID] as [sh].[StakeholderID],[#Results].[HoldingID] = [AttivoGroup_copy].[dbo].[StakeholderHolding].[HoldingID] as [sh].[HoldingID],[#Results].[AgencyCommissionImportID] = [AttivoGroup_copy].[dbo].[AgencyCommissionImport].[AgencyCommissionImportID] as [i].[AgencyCommissionImportID],[#Results].[MatchTypeID] = [Expr1014],[#Results].[indx] = [Expr1013]))
        |--Compute Scalar(DEFINE:([Expr1014]=(1)))
             |--Compute Scalar(DEFINE:([Expr1013]=getidentity((1867869721),(2),N'#Results')))
                  |--Top(ROWCOUNT est 0)
                       |--Hash Match(Inner Join, HASH:([h].[ProviderName], [h].[HoldingReference])=([i].[Provider], [i].[PlanNumber]), RESIDUAL:([AttivoGroup_copy].[dbo].[Holding].[ProviderName] as [h].[ProviderName]=[AttivoGroup_copy].[dbo].[AgencyCommissionImport].[Provider] as [i].[Provider] AND [AttivoGroup_copy].[dbo].[Holding].[HoldingReference] as [h].[HoldingReference]=[AttivoGroup_copy].[dbo].[AgencyCommissionImport].[PlanNumber] as [i].[PlanNumber] AND patindex([Expr1015],[Expr1016])>(0)))
                            |--Nested Loops(Inner Join, OUTER REFERENCES:([sh].[HoldingID]))
                            |    |--Nested Loops(Inner Join, OUTER REFERENCES:([sn].[StakeholderID]))
                            |    |    |--Compute Scalar(DEFINE:([Expr1015]=('%'+#StakeholderNames.[Name] as [sn].[Name])+'%'))
                            |    |    |    |--Table Scan(OBJECT:([tempdb].[dbo].[#StakeholderNames] AS [sn]))
                            |    |    |--Clustered Index Seek(OBJECT:([AttivoGroup_copy].[dbo].[StakeholderHolding].[PK_StakeholderHolding] AS [sh]), SEEK:([sh].[StakeholderID]=#StakeholderNames.[StakeholderID] as [sn].[StakeholderID]) ORDERED FORWARD)
                            |    |--Clustered Index Seek(OBJECT:([AttivoGroup_copy].[dbo].[Holding].[PK_Holding] AS [h]), SEEK:([h].[HoldingID]=[AttivoGroup_copy].[dbo].[StakeholderHolding].[HoldingID] as [sh].[HoldingID]) ORDERED FORWARD)
                            |--Compute Scalar(DEFINE:([Expr1016]=replace(replace([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[ClientName] as [i].[ClientName],' ',''),',','')))
                                 |--Clustered Index Scan(OBJECT:([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[PK_AgencyCommissionImport] AS [i]), WHERE:([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[AgencyCommissionFileID] as [i].[AgencyCommissionFileID]=[@AgencyCommissionFileID] AND [AttivoGroup_copy].[dbo].[AgencyCommissionImport].[MatchTypeID] as [i].[MatchTypeID]=(0) AND ([AttivoGroup_copy].[dbo].[AgencyCommissionImport].[MatchedHoldingID] as [i].[MatchedHoldingID] IS NULL OR [AttivoGroup_copy].[dbo].[AgencyCommissionImport].[MatchedStakeholderID] as [i].[MatchedStakeholderID] IS NULL)))

3 个解决方案

#1

That kind of repeatable difference in performance is most likely due to a difference in the execution plans for the two queries.

在性能上这种可重复的差异很可能是由于两个查询的执行计划有所不同。

Have SQL Server return the actual execution plan when each query is run, and compare the execution plans.

让SQL Server在运行每个查询时返回实际的执行计划，并比较执行计划。

Also, run each query twice, and throw out the timing for the first run, when you compare the performance of the two queries. (The first query run may include a lot of heavy lifting (statement parsing and database i/o). The second run will give you an elapsed time that is more validly compared to the other query.

另外，运行每个查询两次，并在比较两个查询的性能时抛出第一次运行的时间。(第一次查询运行可能包含很多繁重的工作(语句解析和数据库i/o)。第二次运行将会给您一个运行时间，与其他查询相比，这个时间更有效。

Can anyone explain why LIKE is so much slower than PATINDEX?

谁能解释为什么LIKE比PATINDEX慢这么多?

The execution plan for each query will likely explain the difference.

每个查询的执行计划可能会解释差异。

Is it simply a matter of how efficiently the two functions have been written?

这仅仅是两个函数被写得有多高效的问题吗?

It's not really a matter of how efficiently the functions are written. What really matters is the generated execution plan. What matters is if the predicates are sargable and whether the optimizer chooses to use available indexes.

这其实不是函数写得有多高效的问题。真正重要的是生成的执行计划。重要的是，谓词是否可sargable，以及优化器是否选择使用可用的索引。

[EDIT]

(编辑)

In the quick test I ran, I see a difference in the execution plans. With the LIKE operator in the join predicate, the plan includes a "Table Spool (Lazy Spool)" operation on table2 after the "Computer Scalar" operation. With the PATINDEX function, I don't see a "Table Spool" operation in the plan. But the plans I'm getting may be significantly different than the plans you get, given differences in the queries, tables, indexes and statistics.

在快速测试中，我看到了执行计划的不同之处。使用join谓词中的LIKE操作符，该计划在“计算机标量”操作之后，在表2上包含一个“表假脱机(惰性脱机脱机)”操作。使用PATINDEX函数，我在计划中看不到“表假脱机”操作。但考虑到查询、表、索引和统计数据的差异，我得到的计划可能与您得到的计划有很大的不同。

[EDIT]

(编辑)

The only difference I see in the execution plan output for the two queries (aside from expression placeholder names) is the calls to the three internal functions (LikeRangeStart, LikeRangeEnd, and LikeRangeInfo in place of one call to the PATINDEX function. These functions appear to be called for each row in a result set, and the resulting expression are used for scan of the inner table in a nested loop.

我在两个查询的执行计划输出中看到的惟一区别(除了表达式占位符名称)是对三个内部函数的调用(LikeRangeStart, LikeRangeEnd，和LikeRangeInfo代替一个调用PATINDEX函数的调用)。对于结果集中的每一行，似乎都要调用这些函数，结果表达式用于在嵌套循环中扫描内部表。

So, it does look as if the three function calls for the LIKE operator could be more expensive (elapsed time wise) than the single call to the PATINDEX function. (The explain plan shows those functions being called for each row in the outer resultset of a nested loop join; for a large number of rows, even a slight difference in the elapsed time could be multiplied enough times to exhibit a significant performance difference.)

因此，看起来对LIKE操作符的三个函数调用可能比对PATINDEX函数的单个调用更昂贵(占用时间)。(explain plan显示在嵌套循环联接的外部resultset中为每一行调用的函数;对于大量的行，即使是运行时间上的微小差异也可以乘以足够的时间来显示显著的性能差异。

After running some test cases on my system, I'm still baffled at the results you are seeing.

在我的系统上运行了一些测试用例之后，我仍然对您看到的结果感到困惑。

Maybe it is an issue with the performance of the calls to the PATINDEX function vs. the calls to the three internal functions (LikeRangeStart, LikeRangeEnd, LikeRangeInfo.)

可能是调用PATINDEX函数与调用三个内部函数(LikeRangeStart、LikeRangeEnd、LikeRangeInfo)的性能问题。

It's possible that with those performed on a "large" enough result set, a small difference in elapsed time could be multiplied into a significant difference.

有可能，在一个“足够大”的结果集中执行的这些操作，经过的时间的微小差异可以乘以一个显著的差异。

But I actually find it to be somewhat surprising that a query using the LIKE operator would take significantly longer to execute than an equivalent query using the PATINDEX function.

但实际上，我发现使用LIKE操作符的查询比使用PATINDEX函数的等效查询执行时间长得多，这有点令人惊讶。

#2

Perhaps this is a question of DB Caching...

也许这是一个DB缓存的问题……

Try out reset cache before running each query using DBCC helpers:

在使用DBCC助手运行每个查询之前，请尝试重置缓存:

DBCC DROPCLEANBUFFERS
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
DBCC FREEPROCCACHE

#3

I'm not at all convinced by the thesis that it is the extra overhead of the LikeRangeStart, LikeRangeEnd, LikeRangeInfo functions that is responsible for the time discrepancy.

我一点也不相信这个论点，那就是LikeRangeStart、LikeRangeEnd、LikeRangeInfo函数的额外开销导致了时间差异。

It is simply not reproducible (at least in my test, default collation etc). When I try the following

它是不可复制的(至少在我的测试中是这样的，默认排序等等)。当我尝试下面的时候

SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

DECLARE @T TABLE (name sysname )
INSERT INTO @T
SELECT TOP 2500 name + '...' + 
   CAST(ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS VARCHAR)
FROM sys.all_columns

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
PRINT '***'
SELECT     COUNT(*)
FROM       @T AS a
INNER JOIN @T AS b ON (a.name LIKE '%' + b.name + '%')

PRINT '***'
SELECT     COUNT(*)
FROM       @T AS a
INNER JOIN @T AS b ON (PATINDEX('%' + b.name + '%', a.name) > 0)

Which gives essentially the same plan for both but also contains these various internal functions I get the following.

这两个都给出了相同的计划，但也包含了下面我得到的各种内部函数。

LIKE

Table '#5DB5E0CB'. Scan count 2, logical reads 40016
CPU time = 26953 ms,  elapsed time = 28083 ms.

PATINDEX

Table '#5DB5E0CB'. Scan count 2, logical reads 40016
CPU time = 28329 ms,  elapsed time = 29458 ms.

I do notice however that if I substitute a #temp table instead of the table variable the estimated number of rows going into the stream aggregate is significantly different.

但是我确实注意到，如果我用#temp表代替表变量，那么进入流聚合的行数的估计值就会显著不同。

The LIKE version has an estimated 330,596 and PATINDEX an estimated 1,875,000.

同类版本估计有330,596个，PATINDEX估计有1,875,000个。

I notice you also have a hash join in your plan. Possibly because the PATINDEX version seems to estimate a greater number of rows than LIKE this query gets a larger memory grant so doesn't have to spill the hash operation to disc. Try tracing the hash warnings in Profiler to see if this is the case.

我注意到你的计划中也有一个散列连接。可能是因为PATINDEX版本估计的行数似乎比这个查询要多，所以不需要将哈希操作传递给磁盘。尝试跟踪Profiler中的散列警告，看看是否如此。

#1

That kind of repeatable difference in performance is most likely due to a difference in the execution plans for the two queries.

在性能上这种可重复的差异很可能是由于两个查询的执行计划有所不同。

Have SQL Server return the actual execution plan when each query is run, and compare the execution plans.

让SQL Server在运行每个查询时返回实际的执行计划，并比较执行计划。

Can anyone explain why LIKE is so much slower than PATINDEX?

谁能解释为什么LIKE比PATINDEX慢这么多?

The execution plan for each query will likely explain the difference.

每个查询的执行计划可能会解释差异。

Is it simply a matter of how efficiently the two functions have been written?

这仅仅是两个函数被写得有多高效的问题吗?

这其实不是函数写得有多高效的问题。真正重要的是生成的执行计划。重要的是，谓词是否可sargable，以及优化器是否选择使用可用的索引。

[EDIT]

(编辑)

[EDIT]

(编辑)

After running some test cases on my system, I'm still baffled at the results you are seeing.

在我的系统上运行了一些测试用例之后，我仍然对您看到的结果感到困惑。

Maybe it is an issue with the performance of the calls to the PATINDEX function vs. the calls to the three internal functions (LikeRangeStart, LikeRangeEnd, LikeRangeInfo.)

可能是调用PATINDEX函数与调用三个内部函数(LikeRangeStart、LikeRangeEnd、LikeRangeInfo)的性能问题。

It's possible that with those performed on a "large" enough result set, a small difference in elapsed time could be multiplied into a significant difference.

有可能，在一个“足够大”的结果集中执行的这些操作，经过的时间的微小差异可以乘以一个显著的差异。

But I actually find it to be somewhat surprising that a query using the LIKE operator would take significantly longer to execute than an equivalent query using the PATINDEX function.

但实际上，我发现使用LIKE操作符的查询比使用PATINDEX函数的等效查询执行时间长得多，这有点令人惊讶。

#2

Perhaps this is a question of DB Caching...

也许这是一个DB缓存的问题……

Try out reset cache before running each query using DBCC helpers:

在使用DBCC助手运行每个查询之前，请尝试重置缓存:

DBCC DROPCLEANBUFFERS
DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE
DBCC FREEPROCCACHE

#3

I'm not at all convinced by the thesis that it is the extra overhead of the LikeRangeStart, LikeRangeEnd, LikeRangeInfo functions that is responsible for the time discrepancy.

我一点也不相信这个论点，那就是LikeRangeStart、LikeRangeEnd、LikeRangeInfo函数的额外开销导致了时间差异。

It is simply not reproducible (at least in my test, default collation etc). When I try the following

它是不可复制的(至少在我的测试中是这样的，默认排序等等)。当我尝试下面的时候

SET STATISTICS IO OFF;
SET STATISTICS TIME OFF;

DECLARE @T TABLE (name sysname )
INSERT INTO @T
SELECT TOP 2500 name + '...' + 
   CAST(ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS VARCHAR)
FROM sys.all_columns

SET STATISTICS IO ON;
SET STATISTICS TIME ON;
PRINT '***'
SELECT     COUNT(*)
FROM       @T AS a
INNER JOIN @T AS b ON (a.name LIKE '%' + b.name + '%')

PRINT '***'
SELECT     COUNT(*)
FROM       @T AS a
INNER JOIN @T AS b ON (PATINDEX('%' + b.name + '%', a.name) > 0)

Which gives essentially the same plan for both but also contains these various internal functions I get the following.

这两个都给出了相同的计划，但也包含了下面我得到的各种内部函数。

LIKE

Table '#5DB5E0CB'. Scan count 2, logical reads 40016
CPU time = 26953 ms,  elapsed time = 28083 ms.

PATINDEX

Table '#5DB5E0CB'. Scan count 2, logical reads 40016
CPU time = 28329 ms,  elapsed time = 29458 ms.

I do notice however that if I substitute a #temp table instead of the table variable the estimated number of rows going into the stream aggregate is significantly different.

但是我确实注意到，如果我用#temp表代替表变量，那么进入流聚合的行数的估计值就会显著不同。

The LIKE version has an estimated 330,596 and PATINDEX an estimated 1,875,000.

同类版本估计有330,596个，PATINDEX估计有1,875,000个。

秒客网

SQL字符串比较速度'like' vs 'patindex'

3 个解决方案

#1

#2

#3

LIKE

PATINDEX

#1

#2

#3

LIKE

PATINDEX

相关文章