在我的例子中如何提高SQL查询性能

I have a table, schema is very simple, an ID column as unique primary key (uniqueidentifier type) and some other nvarchar columns. My current goal is, for 5000 inputs, I need to calculate what ones are already contained in the table and what are not. Tht inputs are string and I have a C# function which converts string into uniqueidentifier (GUID). My logic is, if there is an existing ID, then I treat the string as already contained in the table.

我有一个表,模式非常简单,一个ID列作为唯一主键(uniqueidentifier类型)和一些其他nvarchar列。我目前的目标是,对于5000个输入,我需要计算表中已包含的内容和不包含的内容。输入是字符串,我有一个C#函数,它将字符串转换为uniqueidentifier(GUID)。我的逻辑是,如果存在现有ID,那么我将字符串视为已包含在表中。

My question is, if I need to find out what ones from the 5000 input strings are already contained in DB, and what are not, what is the most efficient way?

我的问题是,如果我需要找出数据库中已包含5000个输入字符串中的哪些字符串,哪些不存在,那么最有效的方法是什么?

BTW: My current implementation is, convert string to GUID using C# code, then invoke/implement a store procedure which query whether an ID exists in database and returns back to C# code.

BTW:我当前的实现是,使用C#代码将字符串转换为GUID,然后调用/实现一个存储过程,该过程查询数据库中是否存在ID并返回C#代码。

My working environment: VSTS 2008 + SQL Server 2008 + C# 3.5.

我的工作环境:VSTS 2008 + SQL Server 2008 + C#3.5。

7 个解决方案

#1

My first instinct would be to pump your 5000 inputs into a single-column temporary table X, possibly index it, and then use:

我的第一直觉是将5000个输入泵入单列临时表X,可能将其索引,然后使用:

SELECT X.thecol
FROM X
JOIN ExistingTable USING (thecol)

to get the ones that are present, and (if both sets are needed)

得到那些存在的,(如果需要两组)

SELECT X.thecol
FROM X
LEFT JOIN ExistingTable USING (thecol)
WHERE ExistingTable.thecol IS NULL

to get the ones that are absent. Worth benchmarking, at least.

得到那些缺席的人。值得基准测试,至少。

Edit: as requested, here are some good docs & tutorials on temp tables in SQL Server. Bill Graziano has a simple intro covering temp tables, table variables, and global temp tables. Randy Dyess and SQL Master discuss performance issue for and against them (but remember that if you're getting performance problems you do want to benchmark alternatives, not just go on theoretical considerations!-).

编辑:根据要求,这里有一些关于SQL Server临时表的好文档和教程。 Bill Graziano有一个简单的介绍,包括临时表,表变量和全局临时表。 Randy Dyess和SQL Master讨论了支持和支持它们的性能问题(但请记住,如果遇到性能问题,你需要对替代方案进行基准测试,而不仅仅是理论上的考虑! - )。

MSDN has articles on tempdb (where temp tables are kept) and optimizing its performance.

MSDN上有关于tempdb(保存临时表)和优化其性能的文章。

#2

Step 1. Make sure you have a problem to solve. Five thousand inserts isn't a lot to insert one at a time in a lot of contexts.

步骤1.确保您有问题需要解决。在很多情况下,一次插入一个五千个插入并不是很多。

Are you certain that the simplest way possible isn't sufficient? What performance issues have you measured so far?

你确定最简单的方法是不够的吗?到目前为止,您测量了哪些性能问题?

#3

What do you need to do with those entries that do or don't exist in your table??

您需要对表中存在或不存在的条目做什么?

Depending on what you need, maybe the new MERGE statement in SQL Server 2008 could fit your bill - update what's already there, insert new stuff, all wrapped neatly into a single SQL statement. Check it out!

根据您的需要,SQL Server 2008中的新MERGE语句可能适合您的账单 - 更新已经存在的内容,插入新内容,所有内容都整齐地包装到单个SQL语句中。看看这个!

Your statement would look something like this:

你的陈述看起来像这样:

MERGE INTO 
    (your target table) AS t
USING 
    (your source table, e.g. a temporary table) AS s
ON t.ID = s.ID
WHEN NOT MATCHED THEN  -- new rows does not exist in base table
  ....(do whatever you need to do)
WHEN MATCHED THEN      -- row exists in base table
  ... (do whatever else you need to do)
;

To make this really fast, I would load the "new" records from e.g. a TXT or CSV file into a temporary table in SQL server using BULK INSERT:

为了使这个真的很快,我会从例如加载“新”记录。使用BULK INSERT将TXT或CSV文件放入SQL Server中的临时表中:

BULK INSERT YourTemporaryTable
FROM 'c:\temp\yourimportfile.csv'
WITH 
(
    FIELDTERMINATOR =',',
    ROWTERMINATOR =' |\n'
)

BULK INSERT combined with MERGE should give you the best performance you can get on this planet :-)

BULK INSERT结合MERGE可以为您提供在这个星球上获得的最佳性能:-)

Marc

PS: here's a note from TechNet on MERGE performance and why it's faster than individual statements:

PS:这是TechNet关于MERGE性能的说明以及为什么它比单个语句更快:

In SQL Server 2008, you can perform multiple data manipulation language (DML) operations in a single statement by using the MERGE statement. For example, you may need to synchronize two tables by inserting, updating, or deleting rows in one table based on differences found in the other table. Typically, this is done by executing a stored procedure or batch that contains individual INSERT, UPDATE, and DELETE statements. However, this means that the data in both the source and target tables are evaluated and processed multiple times; at least once for each statement. By using the MERGE statement, you can replace the individual DML statements with a single statement. This can improve query performance because the operations are performed within a single statement, therefore, minimizing the number of times the data in the source and target tables are processed. However, performance gains depend on having correct indexes, joins, and other considerations in place. This topic provides best practice recommendations to help you achieve optimal performance when using the MERGE statement.

在SQL Server 2008中,您可以使用MERGE语句在单个语句中执行多个数据操作语言(DML)操作。例如,您可能需要通过在一个表中根据另一个表中找到的差异插入,更新或删除行来同步两个表。通常,这是通过执行包含单个INSERT,UPDATE和DELETE语句的存储过程或批处理来完成的。但是,这意味着源表和目标表中的数据都会被多次评估和处理;每个陈述至少一次。通过使用MERGE语句,您可以使用单个语句替换单个DML语句。这可以提高查询性能,因为操作是在单个语句中执行的,因此,最小化了源表和目标表中的数据的处理次数。但是,性能提升取决于具有正确的索引,连接和其他考虑因素。本主题提供了最佳实践建议,以帮助您在使用MERGE语句时获得最佳性能。

#4

Try to ensure you end up running only one query - i.e. if your solution consists of running 5000 queries against the database, that'll probably be the biggest consumer of resources for the operation.

尽量确保最终只运行一个查询 - 即如果您的解决方案包含对数据库运行5000个查询,那么这可能是该操作资源的最大消费者。

If you can insert the 5000 IDs into a temporary table, you could then write a single query to find the ones that don't exist in the database.

如果可以将5000个ID插入临时表,则可以编写单个查询以查找数据库中不存在的查询。

#5

If you want simplicity, since 5000 records is not very many, then from C# just use a loop to generate an insert statement for each of the strings you want to add to the table. Wrap the insert in a TRY CATCH block. Send em all up to the server in one shot like this:

如果你想要简单,因为5000个记录不是很多,那么从C#中只需使用一个循环为你想要添加到表中的每个字符串生成一个insert语句。将插入物包裹在TRY CATCH块中。像往常一样将em发送到服务器:

BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH

BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH

BEGIN TRY
INSERT INTO table (theCol, field2, field3)
SELECT theGuid, value2, value3
END TRY BEGIN CATCH END CATCH

if you have a unique index or primary key defined on your string GUID, then the duplicate inserts will fail. Checking ahead of time to see if the record does not exist just duplicates work that SQL is going to do anyway.

如果您在字符串GUID上定义了唯一索引或主键,则重复插入将失败。提前检查以查看记录是否不存在只是重复SQL将要执行的工作。

If performance is really important, then consider downloading the 5000 GUIDS to your local station and doing all the analysis localy. Reading 5000 GUIDS should take much less than 1 second. This is simpler than bulk importing to a temp table (which is the only way you will get performance from a temp table) and doing an update using a join to the temp table.

如果性能非常重要,那么考虑将5000 GUIDS下载到本地站并进行所有分析。阅读5000 GUIDS应该花费不到1秒钟。这比批量导入临时表(这是从临时表中获取性能的唯一方法)和使用连接到临时表进行更新更简单。

#6

Since you are using Sql server 2008, you could use Table-valued parameters. It's a way to provide a table as a parameter to a stored procedure.

由于您使用的是Sql server 2008,因此可以使用表值参数。这是一种将表作为存储过程的参数提供的方法。

Using ADO.NET you could easily pre-populate a DataTable and pass it as a SqlParameter. Steps you need to perform:

使用ADO.NET,您可以轻松预填充DataTable并将其作为SqlParameter传递。您需要执行的步骤:

Create a custom Sql Type

创建自定义Sql类型

CREATE TYPE MyType AS TABLE
(
UniqueId INT NOT NULL,
Column NVARCHAR(255) NOT NULL
)

Create a stored procedure which accepts the Type

创建一个接受Type的存储过程

CREATE PROCEDURE spInsertMyType
@Data MyType READONLY
AS 
xxxx

Call using C#

使用C#呼叫

SqlCommand insertCommand = new SqlCommand(
   "spInsertMyType", connection);
 insertCommand.CommandType = CommandType.StoredProcedure;
 SqlParameter tvpParam = 
    insertCommand.Parameters.AddWithValue(
    "@Data", dataReader);
 tvpParam.SqlDbType = SqlDbType.Structured;

Links: Table-valued Parameters in Sql 2008

链接:Sql 2008中的表值参数

#7

Definitely do not do it one-by-one.

绝对不要一个一个地做。

My preferred solution is to create a stored procedure with one parameter that can take and XML in the following format:

我首选的解决方案是创建一个存储过程,其中包含一个可以采用以下格式的XML参数:

<ROOT>
  <MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000">
  <MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000001">
  ....
</ROOT>

Then in the procedure with the argument of type NCHAR(MAX) you convert it to XML, after what you use it as a table with single column (lets call it @FilterTable). The store procedure looks like:

然后在使用NCHAR(MAX)类型的参数的过程中,将它转换为XML,之后将其用作具有单列的表(让我们称之为@FilterTable)。商店程序如下:

CREATE PROCEDURE dbo.sp_MultipleParams(@FilterXML NVARCHAR(MAX))
AS BEGIN
    SET NOCOUNT ON

    DECLARE @x XML
    SELECT @x = CONVERT(XML, @FilterXML)

    -- temporary table (must have it, because cannot join on XML statement)
    DECLARE @FilterTable TABLE (
         "ID" UNIQUEIDENTIFIER
    )

    -- insert into temporary table
    -- @important: XML iS CaSe-SenSiTiv
    INSERT      @FilterTable
    SELECT      x.value('@ID', 'UNIQUEIDENTIFIER')
    FROM        @x.nodes('/ROOT/MyObject') AS R(x)

    SELECT      o.ID,
                SIGN(SUM(CASE WHEN t.ID IS NULL THEN 0 ELSE 1 END)) AS FoundInDB
    FROM        @FilterTable o
    LEFT JOIN   dbo.MyTable t
            ON  o.ID = t.ID
    GROUP BY    o.ID

END
GO

You run it as:

你运行它:

EXEC sp_MultipleParams '<ROOT><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000000"/><MyObject ID="60EAD98F-8A6C-4C22-AF75-000000000002"/></ROOT>'

And your results look like:

你的结果如下:

ID                                   FoundInDB
------------------------------------ -----------
60EAD98F-8A6C-4C22-AF75-000000000000 1
60EAD98F-8A6C-4C22-AF75-000000000002 0

#1