在SQL Server 2005/2008中存储历史数据的最佳方式是什么?

时间:2022-09-16 13:08:57

My simplified and contrived example is the following:-

我的简化和设计的例子是:-

Lets say that I want to measure and store the temperature (and other values) of all the worlds' towns on a daily basis. I am looking for an optimal way of storing the data so that it is just as easy to get the current temperature in all the towns, as it is to get all the temperature historically in one town.

假设我想测量和存储世界上所有城镇的温度(和其他值)。我正在寻找一种最优的方式来存储数据,这样就可以很容易地获取所有城镇的当前温度,就像获取一个城镇的历史温度一样。

It is an easy enough problem to solve, but I am looking for the best solution.

这是一个很容易解决的问题,但我正在寻找最好的解决方案。

The 2 main options I can think of are as follows:-

我能想到的两个主要选择如下:-

Option 1 - Same table stores current and historical records

Store all the current and archive records in the same table.

将所有当前和归档记录存储在同一表中。

i.e.

即。

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This would keep everything simple, but what would be the most efficient query to get a list of towns and there current temperature? Would this scale once the table has millions of rows in? Is there anything to be gained by having some sort of IsCurrent flag in the table?

这将使一切都保持简单,但是要获得一个城镇列表并显示当前温度,最有效的查询是什么呢?一旦表中有数百万行,这个比例是否会增大?如果表中有某种IsCurrent标志,会有什么收获吗?

Option 2 - Store all archive records in a separate table

There would be a table to store the current live measurements in

将有一个表来存储当前的实时测量数据

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

And a table to store historical archived date (inserted by a trigger perhaps)

以及用于存储历史存档日期的表(可能由触发器插入)

CREATE TABLE [dbo].[WeatherMeasurementHistory](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This has the advantages of keeping the main current data lean, and very efficient to query, at the expense of making the schema more complex and inserting data more expensive.

这样做的好处是,可以保持当前数据的精确性,并且查询的效率非常高,而代价是使模式更复杂,插入数据的成本更高。

Which is the best option? Are there better options I haven't mentioned?

哪一个是最好的选择?还有更好的选择吗?

NOTE: I have simplified the schema to help focus my question better, but assume there will be alot of data inserted each day (100,000s of records), and data is current for one day. The current data is just as likely to be queried as the historical.

注意:我简化了模式以帮助更好地集中我的问题,但是假设每天都会插入大量的数据(10万多条记录),并且数据持续一天。当前的数据和历史数据一样可能被查询。

7 个解决方案

#1


12  

it DEPENDS on the applications usage patterns... If usage patterns indicate that the historical data will be queried more often than the current values, then put them all in one table... But if Historical queries are the exception, (or less than 10% of the queries), and the performance of the more common current value query will suffer from putting all data in one table, then it makes sense to separate that data into it's own table...

这取决于应用程序的使用模式……如果使用模式表明历史数据将比当前值更频繁地查询,然后将它们全部放在一个表中……但是,如果历史查询是一个例外(或少于10%的查询),并且更常见的当前值查询的性能将由于将所有数据放在一个表中而受到影响,那么将数据分离到它自己的表中是有意义的……

#2


5  

I would keep the data in one table unless you have a very serious bias for current data (in usage) or history data (in volume). A compound index with DATE + TOWNID (in that order) would remove the performance concern in most cases (although clearly we don't have the data to be sure of this at this time).

我将把数据保存在一个表中,除非您对当前数据(在使用中)或历史数据(在卷中)有非常严重的偏见。使用DATE + TOWNID(按此顺序)的复合索引将在大多数情况下消除性能问题(尽管显然我们目前没有数据来确保这一点)。

The one thing I would wonder about is if anyone will want data from both the current and history data for a town. If so, you just created at least one new view to worry about and possible performance problem in that direction.

我想知道的一件事是,是否有人想要一个城镇的当前和历史数据。如果是这样,您只需要创建至少一个新视图,并在这个方向上担心可能的性能问题。

This is unfortunately one of those things where you may need to profile your solutions against real world data. I personally have used compound indexes such as specified above in many cases, and yet there are a few edge cases where I have opted to break the history into another table. Well, actually another data file, because the problem was that the history was so dense that I created a new data file for it alone to avoid bloating the entire primary data file set. Performance issues are rarely solved by theory.

不幸的是,您可能需要根据现实世界的数据对解决方案进行概要分析。我个人在很多情况下都使用了如上所述的复合索引,但是也有一些边界情况,我选择将历史分解为另一个表。实际上是另一个数据文件,因为问题是历史太过密集,我为它单独创建了一个新的数据文件,以避免膨胀整个主数据文件集。

I would recommend reading up on query hints for index use, and "covering indexes" for more information about performance issues.

我建议阅读关于索引使用的查询提示,以及关于性能问题的更多信息的“覆盖索引”。

#3


3  

Your table is very narrow and would probably perform in a single properly indexed table which would never outstrip the capacity of SQL Server in a traditional normalized OLTP model, even for millions and millions of rows. Even with dual-table model advantages can be mitigated by using table partitioning in SQL Server. So it doesn't have much to recommend it over the single table model. This would be an Inmon-style or "Enterprise Data Warehouse"- scenario.

您的表非常窄,可能会在一个适当索引的表中执行,这个表永远不会超过传统的规范化OLTP模型中的SQL Server的容量,即使有数百万行。即使使用双表模型,也可以通过在SQL Server中使用表分区来减轻这种优势。因此,在单表模型中没有什么值得推荐的。这将是一个inmon样式的或“企业数据仓库”场景。

In much bigger scenarios, I would transfer the data to a data warehouse (modeled with a Kimball-style dimensional model) on a regular basis and simply purge the live data - in some simple scenarios like yours, there might effectively be NO live data - it all goes straight into the warehouse. The dimensional model has a lot of advantages when slicing data different ways and storing huge numbers of facts with a variety of dimensions. Even in the data warehouse scenario, often fact tables are partitioned by date.

在更大的场景中,我将数据转移到数据仓库(维度模型建模与Kimball-style)定期,只是清除数据——生活在一些简单的像你这样的场景,没有实时数据——这一切可能有效地直接进入仓库。多维模型在分割数据的不同方式以及存储大量具有不同维度的事实时具有很多优势。即使在数据仓库场景中,事实表通常也是按日期进行分区的。

It might not seem like your data has this (Town and Date are your only explicit dimensions), however, in most data warehouses, dimensions can snowflake or there can be redundancy, so there would be other dimensions about the fact stored at time of load instead of snowflaking for more efficiency - like State, Zip Code, WasItRaining, IsStationUrban (contrived).

这似乎并不喜欢你的数据(镇和日期是你唯一明确的尺寸),然而,大多数数据仓库维度可以雪花或可能存在冗余,所以会有其他维度的存储负载时效率更高而不是滚雪球——像国家,邮编,WasItRaining,IsStationUrban(做作)。

This might seem silly, but when you start to mine the data for results in data warehouses, this makes asking questions like - on a day with rain in urban environments, what was the average temperature in Maine? - just that little bit easier to get at without joining a whole bunch of tables (i.e. it doesn't require a lot of expertise on your normalized model and performs very quickly). Kind of like useless stats in baseball - but some apparently turn out to be useful.

这可能看起来很傻,但是当你开始在数据仓库中挖掘数据结果时,这就产生了这样的问题——在城市环境下下雨的一天,缅因州的平均气温是多少?-在不加入一大堆表的情况下更容易理解(也就是说,它不需要在规范化模型上有很多专业知识,并且执行得非常快)。有点像棒球中无用的统计数据——但有些显然是有用的。

#4


1  

Another alternative could be to go for one table for all data and have a view for the current temperature. This will not help performance but could well aid readability/maintainability. You could even go for an indexed view to improve performance if you have the appropriate version of sql .

另一种选择是为所有数据选择一个表,并查看当前温度。这对性能没有帮助,但是可以很好地帮助可读性/可维护性。如果您有合适的sql版本,您甚至可以使用索引视图来提高性能。

#5


0  

I suggest keep in the same table since historical data is queried just as often. Unless you will be adding many more columns to the table.

我建议保留同一张表,因为历史数据经常被查询。除非向表中添加更多的列。

When size becomes an issue, you can partition it out by decade and have a stored procedure union the requested rows.

当大小成为问题时,您可以按十年对其进行分区,并将所请求的行与存储过程联合起来。

#6


0  

I would use a single table with index views to provide me with the latest information. SQL 2005 and 2008 server are designed for data warehousing so should preform well under this condition.

我将使用一个带有索引视图的表来提供给我最新的信息。SQL 2005和2008服务器是为数据仓库而设计的,所以在这种情况下应该很好。

If you have a data pattern that requires writing to the db often, then the best choice would be to have an active table and archive table that you batch update at some interval.

如果您有一个需要经常写入db的数据模式,那么最好的选择是有一个活动表和归档表,您可以在某个间隔批量更新。

#7


0  

If you store all in one table how are you going to make a relational database.

如果您将所有数据存储在一个表中,您将如何创建一个关系数据库。

Example:

例子:

id--------------GUID----PK

id - - - - - - - - - - - - - - - - GUID——PK

record_id-------GUID

record_id——guid

every time a new record will be inserted the [id] will change but [record_id] will remain same. Now if you have to link it with address table how are you going to do that?

每次插入新记录时,[id]将会改变,但[record_id]将保持不变。如果你要把它和地址表连接起来你要怎么做呢?

#1


12  

it DEPENDS on the applications usage patterns... If usage patterns indicate that the historical data will be queried more often than the current values, then put them all in one table... But if Historical queries are the exception, (or less than 10% of the queries), and the performance of the more common current value query will suffer from putting all data in one table, then it makes sense to separate that data into it's own table...

这取决于应用程序的使用模式……如果使用模式表明历史数据将比当前值更频繁地查询,然后将它们全部放在一个表中……但是,如果历史查询是一个例外(或少于10%的查询),并且更常见的当前值查询的性能将由于将所有数据放在一个表中而受到影响,那么将数据分离到它自己的表中是有意义的……

#2


5  

I would keep the data in one table unless you have a very serious bias for current data (in usage) or history data (in volume). A compound index with DATE + TOWNID (in that order) would remove the performance concern in most cases (although clearly we don't have the data to be sure of this at this time).

我将把数据保存在一个表中,除非您对当前数据(在使用中)或历史数据(在卷中)有非常严重的偏见。使用DATE + TOWNID(按此顺序)的复合索引将在大多数情况下消除性能问题(尽管显然我们目前没有数据来确保这一点)。

The one thing I would wonder about is if anyone will want data from both the current and history data for a town. If so, you just created at least one new view to worry about and possible performance problem in that direction.

我想知道的一件事是,是否有人想要一个城镇的当前和历史数据。如果是这样,您只需要创建至少一个新视图,并在这个方向上担心可能的性能问题。

This is unfortunately one of those things where you may need to profile your solutions against real world data. I personally have used compound indexes such as specified above in many cases, and yet there are a few edge cases where I have opted to break the history into another table. Well, actually another data file, because the problem was that the history was so dense that I created a new data file for it alone to avoid bloating the entire primary data file set. Performance issues are rarely solved by theory.

不幸的是,您可能需要根据现实世界的数据对解决方案进行概要分析。我个人在很多情况下都使用了如上所述的复合索引,但是也有一些边界情况,我选择将历史分解为另一个表。实际上是另一个数据文件,因为问题是历史太过密集,我为它单独创建了一个新的数据文件,以避免膨胀整个主数据文件集。

I would recommend reading up on query hints for index use, and "covering indexes" for more information about performance issues.

我建议阅读关于索引使用的查询提示,以及关于性能问题的更多信息的“覆盖索引”。

#3


3  

Your table is very narrow and would probably perform in a single properly indexed table which would never outstrip the capacity of SQL Server in a traditional normalized OLTP model, even for millions and millions of rows. Even with dual-table model advantages can be mitigated by using table partitioning in SQL Server. So it doesn't have much to recommend it over the single table model. This would be an Inmon-style or "Enterprise Data Warehouse"- scenario.

您的表非常窄,可能会在一个适当索引的表中执行,这个表永远不会超过传统的规范化OLTP模型中的SQL Server的容量,即使有数百万行。即使使用双表模型,也可以通过在SQL Server中使用表分区来减轻这种优势。因此,在单表模型中没有什么值得推荐的。这将是一个inmon样式的或“企业数据仓库”场景。

In much bigger scenarios, I would transfer the data to a data warehouse (modeled with a Kimball-style dimensional model) on a regular basis and simply purge the live data - in some simple scenarios like yours, there might effectively be NO live data - it all goes straight into the warehouse. The dimensional model has a lot of advantages when slicing data different ways and storing huge numbers of facts with a variety of dimensions. Even in the data warehouse scenario, often fact tables are partitioned by date.

在更大的场景中,我将数据转移到数据仓库(维度模型建模与Kimball-style)定期,只是清除数据——生活在一些简单的像你这样的场景,没有实时数据——这一切可能有效地直接进入仓库。多维模型在分割数据的不同方式以及存储大量具有不同维度的事实时具有很多优势。即使在数据仓库场景中,事实表通常也是按日期进行分区的。

It might not seem like your data has this (Town and Date are your only explicit dimensions), however, in most data warehouses, dimensions can snowflake or there can be redundancy, so there would be other dimensions about the fact stored at time of load instead of snowflaking for more efficiency - like State, Zip Code, WasItRaining, IsStationUrban (contrived).

这似乎并不喜欢你的数据(镇和日期是你唯一明确的尺寸),然而,大多数数据仓库维度可以雪花或可能存在冗余,所以会有其他维度的存储负载时效率更高而不是滚雪球——像国家,邮编,WasItRaining,IsStationUrban(做作)。

This might seem silly, but when you start to mine the data for results in data warehouses, this makes asking questions like - on a day with rain in urban environments, what was the average temperature in Maine? - just that little bit easier to get at without joining a whole bunch of tables (i.e. it doesn't require a lot of expertise on your normalized model and performs very quickly). Kind of like useless stats in baseball - but some apparently turn out to be useful.

这可能看起来很傻,但是当你开始在数据仓库中挖掘数据结果时,这就产生了这样的问题——在城市环境下下雨的一天,缅因州的平均气温是多少?-在不加入一大堆表的情况下更容易理解(也就是说,它不需要在规范化模型上有很多专业知识,并且执行得非常快)。有点像棒球中无用的统计数据——但有些显然是有用的。

#4


1  

Another alternative could be to go for one table for all data and have a view for the current temperature. This will not help performance but could well aid readability/maintainability. You could even go for an indexed view to improve performance if you have the appropriate version of sql .

另一种选择是为所有数据选择一个表,并查看当前温度。这对性能没有帮助,但是可以很好地帮助可读性/可维护性。如果您有合适的sql版本,您甚至可以使用索引视图来提高性能。

#5


0  

I suggest keep in the same table since historical data is queried just as often. Unless you will be adding many more columns to the table.

我建议保留同一张表,因为历史数据经常被查询。除非向表中添加更多的列。

When size becomes an issue, you can partition it out by decade and have a stored procedure union the requested rows.

当大小成为问题时,您可以按十年对其进行分区,并将所请求的行与存储过程联合起来。

#6


0  

I would use a single table with index views to provide me with the latest information. SQL 2005 and 2008 server are designed for data warehousing so should preform well under this condition.

我将使用一个带有索引视图的表来提供给我最新的信息。SQL 2005和2008服务器是为数据仓库而设计的,所以在这种情况下应该很好。

If you have a data pattern that requires writing to the db often, then the best choice would be to have an active table and archive table that you batch update at some interval.

如果您有一个需要经常写入db的数据模式,那么最好的选择是有一个活动表和归档表,您可以在某个间隔批量更新。

#7


0  

If you store all in one table how are you going to make a relational database.

如果您将所有数据存储在一个表中,您将如何创建一个关系数据库。

Example:

例子:

id--------------GUID----PK

id - - - - - - - - - - - - - - - - GUID——PK

record_id-------GUID

record_id——guid

every time a new record will be inserted the [id] will change but [record_id] will remain same. Now if you have to link it with address table how are you going to do that?

每次插入新记录时,[id]将会改变,但[record_id]将保持不变。如果你要把它和地址表连接起来你要怎么做呢?