存储大型SQL数据查询和计数的最有效方法

时间:2023-02-10 23:47:38

I have a SQL Server database with a large amount of data (65 million rows mostly of text, 8Gb total). The data gets changed only once per week. I have an ASP.NET web application that will run several SQL queries on this data that will count the number of rows satisfying various conditions. Since the data gets changed only once per week, what is the most efficient way to store both the SQL queries and their counts for the week? Should I store it in the database or in the application?

我有一个包含大量数据的SQL Server数据库(6500万行主要是文本,总共8Gb)。数据每周只更改一次。我有一个ASP。NET web应用程序将在此数据上运行几个SQL查询,这些查询将计算满足各种条件的行数。由于数据每周只更改一次,那么存储SQL查询及其计数的最有效方法是什么呢?我应该将它存储在数据库中还是应用程序中?

3 个解决方案

#1


3  

If the data is only modified once a week, as part of and at the end of that (ETL?) process, perform your "basic" counts and store the results in a table in the database. Thereafter, rather than lengthy queries on the big tables, you can just query those small summary tables.

如果数据仅在一个星期内修改一次,作为(ETL?)过程的一部分,执行您的“基本”计数并将结果存储在数据库中的表中。此后,您只需查询那些小的汇总表,而不是在大表上进行冗长的查询。

#2


2  

If you do not need 100% up-to-the-minute accurate row counts, you could query SQL Server's internal info:

如果您不需要100%实时准确的行数,您可以查询SQL Server的内部信息:

Select so.name as 'TableName', si.rowcnt as 'RowCount'
from sysobjects so
inner join sysindexes si on so.id = si.id 
where so.type = 'u' and indid < 2

Very quick to execute and no extra tables required. Not accurate where many updates are occurring but might be accurate enough in your intended usage. [Thank you to commenters!]

执行非常快,不需要额外的表。不准确的地方,许多更新正在发生,但可能足够准确,在您的预期使用。(谢谢评论者!)

Update: did a bit of digging and this does produce accurate counts (slower due to the sum, but still quick):

更新:进行了一些挖掘,这确实产生了准确的计数(由于总数的原因,计数较慢,但仍然很快):

SELECT OBJECT_SCHEMA_NAME(ps.object_id) AS SchemaName, 
       OBJECT_NAME(ps.object_id) AS ObjectName, 
       SUM(ps.row_count) AS row_count
FROM sys.dm_db_partition_stats ps
JOIN sys.indexes i ON i.object_id = ps.object_id
                      AND i.index_id = ps.index_id
WHERE i.type_desc IN ('CLUSTERED','HEAP')
AND OBJECT_SCHEMA_NAME(ps.object_id) <> 'sys'
GROUP BY ps.object_id
ORDER BY OBJECT_NAME(ps.object_id), OBJECT_SCHEMA_NAME(ps.object_id)

Ref.

Ref。

Remember that the stored count information was not always 100% accurate in SQL Server 2000. For a new table created on 2005 the counts will be accurate. But for a table that existed in 2000 and now resides on 2005 through a restore or update, you need to run (only once after the move to 2005) either sp_spaceused @updateusage = N'true' or DBCC UPDATEUSAGE with the COUNT_ROWS option.

记住,在SQL Server 2000中存储的计数信息并不总是100%准确的。对于2005年创建的新表,计数将是准确的。但是对于一个存在于2000年的表,现在通过一个恢复或更新来驻留在2005年,您需要运行(在迁移到2005年之后),sp_spaceused @updateusage = N'true'或DBCC UPDATEUSAGE与COUNT_ROWS选项。

#3


0  

The queries should be stored as stored procedures or views, depending on complexity.

根据复杂性,查询应该存储为存储过程或视图。

For your situation I would look into indexed views.

针对您的情况,我将研究索引视图。

They let you both store a query AND the result set for things like aggregation that otherwise cannot be indexed.

它们允许您同时存储查询和诸如聚合之类的结果集,否则无法对其进行索引。

As a bonus, the query optimizer "knows" it has this data as well, so if you check for a count or something else stored in the view index in another query (even one not referencing the view directly) it can still use that stored data.

另一个好处是,查询优化器“知道”它也有这个数据,所以如果您检查另一个查询中存储在视图索引中的计数或其他内容(即使一个查询没有直接引用视图),它仍然可以使用该存储的数据。

#1


3  

If the data is only modified once a week, as part of and at the end of that (ETL?) process, perform your "basic" counts and store the results in a table in the database. Thereafter, rather than lengthy queries on the big tables, you can just query those small summary tables.

如果数据仅在一个星期内修改一次,作为(ETL?)过程的一部分,执行您的“基本”计数并将结果存储在数据库中的表中。此后,您只需查询那些小的汇总表,而不是在大表上进行冗长的查询。

#2


2  

If you do not need 100% up-to-the-minute accurate row counts, you could query SQL Server's internal info:

如果您不需要100%实时准确的行数,您可以查询SQL Server的内部信息:

Select so.name as 'TableName', si.rowcnt as 'RowCount'
from sysobjects so
inner join sysindexes si on so.id = si.id 
where so.type = 'u' and indid < 2

Very quick to execute and no extra tables required. Not accurate where many updates are occurring but might be accurate enough in your intended usage. [Thank you to commenters!]

执行非常快,不需要额外的表。不准确的地方,许多更新正在发生,但可能足够准确,在您的预期使用。(谢谢评论者!)

Update: did a bit of digging and this does produce accurate counts (slower due to the sum, but still quick):

更新:进行了一些挖掘,这确实产生了准确的计数(由于总数的原因,计数较慢,但仍然很快):

SELECT OBJECT_SCHEMA_NAME(ps.object_id) AS SchemaName, 
       OBJECT_NAME(ps.object_id) AS ObjectName, 
       SUM(ps.row_count) AS row_count
FROM sys.dm_db_partition_stats ps
JOIN sys.indexes i ON i.object_id = ps.object_id
                      AND i.index_id = ps.index_id
WHERE i.type_desc IN ('CLUSTERED','HEAP')
AND OBJECT_SCHEMA_NAME(ps.object_id) <> 'sys'
GROUP BY ps.object_id
ORDER BY OBJECT_NAME(ps.object_id), OBJECT_SCHEMA_NAME(ps.object_id)

Ref.

Ref。

Remember that the stored count information was not always 100% accurate in SQL Server 2000. For a new table created on 2005 the counts will be accurate. But for a table that existed in 2000 and now resides on 2005 through a restore or update, you need to run (only once after the move to 2005) either sp_spaceused @updateusage = N'true' or DBCC UPDATEUSAGE with the COUNT_ROWS option.

记住,在SQL Server 2000中存储的计数信息并不总是100%准确的。对于2005年创建的新表,计数将是准确的。但是对于一个存在于2000年的表,现在通过一个恢复或更新来驻留在2005年,您需要运行(在迁移到2005年之后),sp_spaceused @updateusage = N'true'或DBCC UPDATEUSAGE与COUNT_ROWS选项。

#3


0  

The queries should be stored as stored procedures or views, depending on complexity.

根据复杂性,查询应该存储为存储过程或视图。

For your situation I would look into indexed views.

针对您的情况,我将研究索引视图。

They let you both store a query AND the result set for things like aggregation that otherwise cannot be indexed.

它们允许您同时存储查询和诸如聚合之类的结果集,否则无法对其进行索引。

As a bonus, the query optimizer "knows" it has this data as well, so if you check for a count or something else stored in the view index in another query (even one not referencing the view directly) it can still use that stored data.

另一个好处是,查询优化器“知道”它也有这个数据,所以如果您检查另一个查询中存储在视图索引中的计数或其他内容(即使一个查询没有直接引用视图),它仍然可以使用该存储的数据。