使用SQL Server处理大型数据集

时间:2021-08-11 21:42:04

I'm looking to manage a large dataset of log files. There is an average of 1.5 million new events per month that I'm trying to keep. I've used access in the past, though it's clearly not meant for this, and managing the dataset is a nightmare, because I'm having to split the datasets into months.

我正在寻找管理日志文件的大型数据集。我想保留每月平均150万个新事件。我过去曾使用过访问权限,虽然它显然不适用于此,管理数据集是一场噩梦,因为我不得不将数据集分成几个月。

For the most part, I just need to filter event types and count the number. But before I do a bunch of work on the data import side of things, I wanted to see if anyone can verify that this SQL Server is a good choice for this. Is there an entry limit I should avoid and archive entries? Is there a way of archiving entries?

在大多数情况下,我只需要过滤事件类型并计算数量。但在我对数据导入方面做了大量工作之前,我想看看是否有人可以验证这个SQL Server是一个不错的选择。我应该避免和存档条目的入口限制吗?有没有归档条目的方法?

The other part is that I'm entering logs from multiple sources, with this amount of entries, is it wise to put them all into the same table, or should each source have their own table, to make queries faster?

另一部分是我从多个来源输入日志,有了这么多的条目,将它们全部放在同一个表中是明智的,还是每个来源都有自己的表,以便更快地进行查询?


edit...
There would be no joins, and about 10 columns. Data would be filtered through a view, and I'm interested to see if the results from a select query that filter based on one or more columns would have a reasonable response time? Does creating a set of views speed things up for frequent queries?

编辑...没有连接,大约10列。数据将通过视图进行过滤,我很想知道基于一列或多列过滤的选择查询的结果是否具有合理的响应时间?创建一组视图是否可以加快频繁查询的速度?

2 个解决方案

#1


5  

In my experience, SQL Server is a fine choice for this, and you can definitely expect better performance from SQL Server than MS-Access, with generally more optimization methods at your disposal.

根据我的经验,SQL Server是一个很好的选择,你可以期待SQL Server比MS-Access更好的性能,通常你可以使用更多的优化方法。

I would probably go ahead and put this stuff into SQL Server Express as you've said, hopefully installed on the best machine you can use (though you did mention only 2GB of RAM). Use one table so long as it only represents one thing (I would think a pilot's flight log and a software error log wouldn't be in the same "log" table, as an absurdly contrived example). Check your performance. If it's an issue, move forward with any number of optimization techniques available to your edition of SQL Server.

我可能会继续把这些东西放到SQL Server Express中,如你所说,希望安装在你可以使用的最好的机器上(虽然你确实只提到了2GB的RAM)。使用一个表只要它只代表一件事(我认为飞行员的飞行日志和软件错误日志不会在同一个“日志”表中,作为一个荒谬的人为例子)。检查你的表现。如果这是一个问题,请继续使用您的SQL Server版本提供的任意数量的优化技术。

Here's how I would probably do it initially:

这是我最初可能会这样做的:

Create your table with a non-clustered primary key, if you use a PK on your log table -- I normally use an identity column to give me a guaranteed order of events (unlike duplicate datetimes) and show possible log insert failures (missing identities). Set a clustered index on the main datetime column (you mentioned that your're already splitting into separate tables by month, so I assume you'll query this way, too). If you have a few queries that you run on this table routinely, by all means make views of them but don't expect a speedup by simply doing so. You'll more than likely want to look at indexing your table based upon the where clauses in those queries. This is where you'll be giving SQL server the information it needs to run those queries efficiently.

如果在日志表上使用PK,则使用非群集主键创建表 - 我通常使用标识列为我提供有保证的事件顺序(与重复日期时间不同)并显示可能的日志插入失败(缺少标识) )。在主日期时间列上设置聚簇索引(您提到您已按月拆分为单独的表,因此我假设您也将以这种方式查询)。如果您经常在此表上运行一些查询,请务必查看它们,但不要仅仅通过这样做来获得加速。您可能希望根据这些查询中的where子句来查看对表的索引。在这里,您将为SQL Server提供有效运行这些查询所需的信息。

If you're unable to get your desired performance through optimizing your queries, indexes, using the smallest possible datatypes (especially on your indexed columns) and running on decent hardware, it may be time to try partitioned views (which require some form of ongoing maintenance) or partitioning your table. Unfortunately, SQL Server Express may limit you on what you can do with partitioning, and you'll have to decide if you need to move to a more feature-filled edition of SQL Server. You could always test partitioning with the Enterprise evaluation or Developer editions.

如果您无法通过优化查询,索引,使用尽可能小的数据类型(特别是在索引列上)并在合适的硬件上运行来获得所需的性能,那么可能是时候尝试分区视图(需要某种形式的持续性)维护)或分区你的桌子。不幸的是,SQL Server Express可能会限制您使用分区所能做的事情,并且您必须决定是否需要转移到功能更丰富的SQL Server版本。您始终可以使用Enterprise评估版或Developer版测试分区。

Update:

更新:

For the most part, I just need to filter event types and count the number.

在大多数情况下,我只需要过滤事件类型并计算数量。

Since past logs don't change (sort of like past sales data), storing the past aggregate numbers is an often-used strategy in this scenario. You can create a table which simply stores your counts for each month and insert new counts once a month (or week, day, etc.) with a scheduled job of some sort. Using the clustered index on your datetime column, SQL Server could much more easily aggregate the current month's numbers from the live table and add them to the stored aggregates for displaying the current values of total counts and such.

由于过去的日志不会发生变化(有点像过去的销售数据),因此存储过去的总数是这种情况下经常使用的策略。您可以创建一个表格,该表格只存储每个月的计数,并且每月(或周,日等)使用某种计划作业插入新计数。使用日期时间列上的聚簇索引,SQL Server可以更轻松地从实时表中聚合当前月份的数字,并将它们添加到存储的聚合中,以显示总计数的当前值等。

#2


1  

Sounds like one table to me, that would need indexes on exactly the sets of columns you will filter. Restricting access through views is generally a good idea and ensures your indexes will actually get used.

对我来说听起来就像一张桌子,这将需要您将要过滤的列集上的索引。限制通过视图访问通常是一个好主意,并确保您的索引实际上得到使用。

Putting each source into their own table will require UNION in your queries later, and SQL-Server is not very good optimizing UNION-queries.

将每个源放入自己的表中将在以后的查询中使用UNION,并且SQL-Server不是很好地优化UNION查询。

"Archiving" entries can of course be done manually, by moving entries in a date-range to another table (that can live on another disk or database), or by using "partitioning", which means you can put parts of a table (e.g. defined by date-ranges) on different disks. You have to plan for the partitions when you plan your SQL-Server installation.

“归档”条目当然可以手动完成,方法是将日期范围内的条目移动到另一个表(可以存在于另一个磁盘或数据库中),或者使用“分区”,这意味着您可以放置​​一部分表(例如,由日期范围定义)在不同的磁盘上。在规划SQL-Server安装时,必须规划分区。

Be aware that Express edition is limited to 4GB, so at 1.5 million rows per month this could be a problem.

请注意,Express版本限制为4GB,因此每月150万行可能会出现问题。

I have a table like yours with 20M rows and little problems querying and even joining, if the indexes are used.

如果使用索引,我有一个像你一样的表,有20M行,查询甚至加入的问题都很少。

#1


5  

In my experience, SQL Server is a fine choice for this, and you can definitely expect better performance from SQL Server than MS-Access, with generally more optimization methods at your disposal.

根据我的经验,SQL Server是一个很好的选择,你可以期待SQL Server比MS-Access更好的性能,通常你可以使用更多的优化方法。

I would probably go ahead and put this stuff into SQL Server Express as you've said, hopefully installed on the best machine you can use (though you did mention only 2GB of RAM). Use one table so long as it only represents one thing (I would think a pilot's flight log and a software error log wouldn't be in the same "log" table, as an absurdly contrived example). Check your performance. If it's an issue, move forward with any number of optimization techniques available to your edition of SQL Server.

我可能会继续把这些东西放到SQL Server Express中,如你所说,希望安装在你可以使用的最好的机器上(虽然你确实只提到了2GB的RAM)。使用一个表只要它只代表一件事(我认为飞行员的飞行日志和软件错误日志不会在同一个“日志”表中,作为一个荒谬的人为例子)。检查你的表现。如果这是一个问题,请继续使用您的SQL Server版本提供的任意数量的优化技术。

Here's how I would probably do it initially:

这是我最初可能会这样做的:

Create your table with a non-clustered primary key, if you use a PK on your log table -- I normally use an identity column to give me a guaranteed order of events (unlike duplicate datetimes) and show possible log insert failures (missing identities). Set a clustered index on the main datetime column (you mentioned that your're already splitting into separate tables by month, so I assume you'll query this way, too). If you have a few queries that you run on this table routinely, by all means make views of them but don't expect a speedup by simply doing so. You'll more than likely want to look at indexing your table based upon the where clauses in those queries. This is where you'll be giving SQL server the information it needs to run those queries efficiently.

如果在日志表上使用PK,则使用非群集主键创建表 - 我通常使用标识列为我提供有保证的事件顺序(与重复日期时间不同)并显示可能的日志插入失败(缺少标识) )。在主日期时间列上设置聚簇索引(您提到您已按月拆分为单独的表,因此我假设您也将以这种方式查询)。如果您经常在此表上运行一些查询,请务必查看它们,但不要仅仅通过这样做来获得加速。您可能希望根据这些查询中的where子句来查看对表的索引。在这里,您将为SQL Server提供有效运行这些查询所需的信息。

If you're unable to get your desired performance through optimizing your queries, indexes, using the smallest possible datatypes (especially on your indexed columns) and running on decent hardware, it may be time to try partitioned views (which require some form of ongoing maintenance) or partitioning your table. Unfortunately, SQL Server Express may limit you on what you can do with partitioning, and you'll have to decide if you need to move to a more feature-filled edition of SQL Server. You could always test partitioning with the Enterprise evaluation or Developer editions.

如果您无法通过优化查询,索引,使用尽可能小的数据类型(特别是在索引列上)并在合适的硬件上运行来获得所需的性能,那么可能是时候尝试分区视图(需要某种形式的持续性)维护)或分区你的桌子。不幸的是,SQL Server Express可能会限制您使用分区所能做的事情,并且您必须决定是否需要转移到功能更丰富的SQL Server版本。您始终可以使用Enterprise评估版或Developer版测试分区。

Update:

更新:

For the most part, I just need to filter event types and count the number.

在大多数情况下,我只需要过滤事件类型并计算数量。

Since past logs don't change (sort of like past sales data), storing the past aggregate numbers is an often-used strategy in this scenario. You can create a table which simply stores your counts for each month and insert new counts once a month (or week, day, etc.) with a scheduled job of some sort. Using the clustered index on your datetime column, SQL Server could much more easily aggregate the current month's numbers from the live table and add them to the stored aggregates for displaying the current values of total counts and such.

由于过去的日志不会发生变化(有点像过去的销售数据),因此存储过去的总数是这种情况下经常使用的策略。您可以创建一个表格,该表格只存储每个月的计数,并且每月(或周,日等)使用某种计划作业插入新计数。使用日期时间列上的聚簇索引,SQL Server可以更轻松地从实时表中聚合当前月份的数字,并将它们添加到存储的聚合中,以显示总计数的当前值等。

#2


1  

Sounds like one table to me, that would need indexes on exactly the sets of columns you will filter. Restricting access through views is generally a good idea and ensures your indexes will actually get used.

对我来说听起来就像一张桌子,这将需要您将要过滤的列集上的索引。限制通过视图访问通常是一个好主意,并确保您的索引实际上得到使用。

Putting each source into their own table will require UNION in your queries later, and SQL-Server is not very good optimizing UNION-queries.

将每个源放入自己的表中将在以后的查询中使用UNION,并且SQL-Server不是很好地优化UNION查询。

"Archiving" entries can of course be done manually, by moving entries in a date-range to another table (that can live on another disk or database), or by using "partitioning", which means you can put parts of a table (e.g. defined by date-ranges) on different disks. You have to plan for the partitions when you plan your SQL-Server installation.

“归档”条目当然可以手动完成,方法是将日期范围内的条目移动到另一个表(可以存在于另一个磁盘或数据库中),或者使用“分区”,这意味着您可以放置​​一部分表(例如,由日期范围定义)在不同的磁盘上。在规划SQL-Server安装时,必须规划分区。

Be aware that Express edition is limited to 4GB, so at 1.5 million rows per month this could be a problem.

请注意,Express版本限制为4GB,因此每月150万行可能会出现问题。

I have a table like yours with 20M rows and little problems querying and even joining, if the indexes are used.

如果使用索引,我有一个像你一样的表,有20M行,查询甚至加入的问题都很少。