数据库架构设计 - 提高存档能力的技巧?

时间:2022-06-09 03:38:09

I am designing a table in the database which will store log entries from the application. There are a few things which is making me think about this design more than usual.

我正在数据库中设计一个表,它将存储来自应用程序的日志条目。有一些东西让我比平常更多地考虑这个设计。

  • However these log entries will be used at runtime by the system to make decisions so they need to be relatively fast to access.
  • 但是,这些日志条目将在运行时由系统用于做出决策,因此它们需要相对快速地访问。

  • They also have the problem is that there is going to be lots of them (12.5 million added per month is my estimate).
  • 他们也有问题是它们会有很多(我估计每月增加1250万)。

  • I don't need more than the last 30 to 45 days at most for the decision processing.
  • 对于决策处理,我最多不需要超过最后30到45天。

  • I need to keep all of them for much longer than 45 days for support & legal issues, likely atleast 2 years.
  • 我需要将所有这些保留超过45天以获得支持和法律问题,可能至少需要2年。

  • The table design is fairly simple, all simple types (no blobs or anything), where possible will use the database engine to put in the default data, at most one foreign key.
  • 表设计相当简单,所有简单类型(没有blob或任何东西),在可能的情况下将使用数据库引擎放入默认数据,最多一个外键。

  • If it makes any difference the database will be Microsoft SQL Server 2005.
  • 如果它有所不同,数据库将是Microsoft SQL Server 2005。

What I was thinking is having them written to a live table/database and then using an ETL solution move "old" entries to an archive table/database - which is big and on slower hardware.

我在想的是将它们写入实时表/数据库,然后使用ETL解决方案将“旧”条目移动到存档表/数据库 - 这很大且速度较慢。

My question is do you know of any tips, tricks or suggestions for the database/table design to make sure this works as well as possible? Also if you think it's a bad idea please let me know, and what you think a better idea would be.

我的问题是你知道有关数据库/表格设计的任何提示,技巧或建议,以确保其尽可能的有效吗?此外,如果你认为这是一个坏主意,请告诉我,你认为更好的想法。

4 个解决方案

#1


Some databases offer "partitions" (Oracle, for example). A partition is like a view which collects several tables with an identical definition into one. You can define criteria which sort new data into the different tables (for example, the month or week-of-year % 6).

一些数据库提供“分区”(例如,Oracle)。分区就像一个视图,它将具有相同定义的多个表收集到一个中。您可以定义将新数据分类到不同表中的条件(例如,月份或周年%6)。

From a user point of view, this is just one table. From the database PoV, it's several independent tables, so you can run full table commands (like truncate, drop, delete from table (without a condition), load/dump, etc.) against them in an efficient manner.

从用户的角度来看,这只是一个表。从数据库PoV,它是几个独立的表,因此您可以以有效的方式运行全表命令(如截断,删除,从表中删除(没有条件),加载/转储等)。

If you can't have a partition, you get a similar effect with views. In this case, you can collect several tables in a single view and redefine this view, say, once a month to "free" one table with old data from the rest. Now, you can efficiently archive this table, clear it and attach it again to the view when the big work has been done. This should help greatly to improve performance.

如果您没有分区,则会对视图产生类似的影响。在这种情况下,您可以在一个视图中收集多个表,并重新定义此视图,例如,每月一次“释放”一个表,其中包含其余的旧数据。现在,您可以有效地归档此表,清除它并在完成大量工作时将其再次附加到视图中。这应该有助于提高性能。

[EDIT] SQL server 2005 onwards (Enterprise Edition) supports partitions. Thanks to Mitch Wheat

[编辑] SQL Server 2005及更高版本(企业版)支持分区。感谢Mitch Wheat

#2


Big tables slow down quickly, and it's a big performance overhead to use ETL to pull data based on date, from a big table and then delete the old rows. The answer to this is to use multiple tables - probably 1 table/month based on your figures. Of course you'll need some logic to generate the table names within your queries.

大表快速减速,使用ETL从大型表中提取数据,然后删除旧行是一个很大的性能开销。答案是使用多个表 - 根据您的数字,可能是1个表/月。当然,您需要一些逻辑来在查询中生成表名。

I agree with using Triggers to populate the 'CurrentMonthAudit' table, at the end of month, you can then rename that table to MonthAuditYYYYMM. Moving old tables off your main server using ETL will then be easy, and each of your tables will be manageable. Trust me this is much better than trying to manage a single table with approx 250M rows.

我同意使用触发器填充'CurrentMonthAudit'表,在月末,您可以将该表重命名为MonthAuditYYYYMM。使用ETL将旧表从主服务器移出将很容易,并且每个表都是可管理的。相信我,这比尝试管理大约250M行的单个表要好得多。

#3


Your first good decision is keeping everything as simple as possible.

你的第一个好决定是让一切尽可能简单。

I've had good luck with your pattern of a simple write-only transaction log file where the records are just laid down in chronological order. Then you have several options for switching out aged data. Even having monthly disparate tables is manageable query-wise as long as you keep simplicity in mind. If you have any kind of replication in operation, your replicated tables can be rolled out and serve as the archive. Then start with a fresh empty table at the first of each month.

我对你的简单只写事务日志文件的模式运气不错,其中记录按时间顺序排列。然后,您有多个选项可用于切换老化数据。只要记住简单性,即使每月使用不同的表也是可管理的查询。如果您正在进行任何类型的复制,则可以推出复制的表并将其用作存档。然后在每个月的第一天用一张新的空桌开始。

Normally I shudder at the relational design consequences of doing something like this, but I've found that write-only chronological log tables are an exception to the usual design patterns, for the reasons you are dealing with here.

通常我对做这样的事情的关系设计后果感到不寒而栗,但是我发现只写时间顺序日志表是常见设计模式的一个例外,因为你在这里处理的原因。

But stay away from triggers. As far as possible. The simplest solution is a primary table of the type you're talking about here, with a simple robust off-the-shelf time-proven replication mechanism.

但要远离触发器。越远越好。最简单的解决方案是您在此处讨论的类型的主表,具有简单强大的现成的经过时间验证的复制机制。

(BTW - Large tables don't slow down quickly if they are well designed - they slow down slowly.)

(顺便说一句 - 如果设计得很好,大型桌子不会很​​快减速 - 它们会慢慢减速。)

#4


If you do not need to search the recent log records, there is another option: Don't use a database at all. Instead, write the log info to a file and rotate the filename every night. When a file has been written, you can then start a background job to import the data directly into the archive database.

如果您不需要搜索最近的日志记录,还有另一种选择:根本不使用数据库。而是将日志信息写入文件并每晚旋转文件名。写入文件后,您可以启动后台作业以将数据直接导入存档数据库。

Databases are not always the best option, especially for log files :)

数据库并不总是最好的选择,特别是对于日志文件:)

#1


Some databases offer "partitions" (Oracle, for example). A partition is like a view which collects several tables with an identical definition into one. You can define criteria which sort new data into the different tables (for example, the month or week-of-year % 6).

一些数据库提供“分区”(例如,Oracle)。分区就像一个视图,它将具有相同定义的多个表收集到一个中。您可以定义将新数据分类到不同表中的条件(例如,月份或周年%6)。

From a user point of view, this is just one table. From the database PoV, it's several independent tables, so you can run full table commands (like truncate, drop, delete from table (without a condition), load/dump, etc.) against them in an efficient manner.

从用户的角度来看,这只是一个表。从数据库PoV,它是几个独立的表,因此您可以以有效的方式运行全表命令(如截断,删除,从表中删除(没有条件),加载/转储等)。

If you can't have a partition, you get a similar effect with views. In this case, you can collect several tables in a single view and redefine this view, say, once a month to "free" one table with old data from the rest. Now, you can efficiently archive this table, clear it and attach it again to the view when the big work has been done. This should help greatly to improve performance.

如果您没有分区,则会对视图产生类似的影响。在这种情况下,您可以在一个视图中收集多个表,并重新定义此视图,例如,每月一次“释放”一个表,其中包含其余的旧数据。现在,您可以有效地归档此表,清除它并在完成大量工作时将其再次附加到视图中。这应该有助于提高性能。

[EDIT] SQL server 2005 onwards (Enterprise Edition) supports partitions. Thanks to Mitch Wheat

[编辑] SQL Server 2005及更高版本(企业版)支持分区。感谢Mitch Wheat

#2


Big tables slow down quickly, and it's a big performance overhead to use ETL to pull data based on date, from a big table and then delete the old rows. The answer to this is to use multiple tables - probably 1 table/month based on your figures. Of course you'll need some logic to generate the table names within your queries.

大表快速减速,使用ETL从大型表中提取数据,然后删除旧行是一个很大的性能开销。答案是使用多个表 - 根据您的数字,可能是1个表/月。当然,您需要一些逻辑来在查询中生成表名。

I agree with using Triggers to populate the 'CurrentMonthAudit' table, at the end of month, you can then rename that table to MonthAuditYYYYMM. Moving old tables off your main server using ETL will then be easy, and each of your tables will be manageable. Trust me this is much better than trying to manage a single table with approx 250M rows.

我同意使用触发器填充'CurrentMonthAudit'表,在月末,您可以将该表重命名为MonthAuditYYYYMM。使用ETL将旧表从主服务器移出将很容易,并且每个表都是可管理的。相信我,这比尝试管理大约250M行的单个表要好得多。

#3


Your first good decision is keeping everything as simple as possible.

你的第一个好决定是让一切尽可能简单。

I've had good luck with your pattern of a simple write-only transaction log file where the records are just laid down in chronological order. Then you have several options for switching out aged data. Even having monthly disparate tables is manageable query-wise as long as you keep simplicity in mind. If you have any kind of replication in operation, your replicated tables can be rolled out and serve as the archive. Then start with a fresh empty table at the first of each month.

我对你的简单只写事务日志文件的模式运气不错,其中记录按时间顺序排列。然后,您有多个选项可用于切换老化数据。只要记住简单性,即使每月使用不同的表也是可管理的查询。如果您正在进行任何类型的复制,则可以推出复制的表并将其用作存档。然后在每个月的第一天用一张新的空桌开始。

Normally I shudder at the relational design consequences of doing something like this, but I've found that write-only chronological log tables are an exception to the usual design patterns, for the reasons you are dealing with here.

通常我对做这样的事情的关系设计后果感到不寒而栗,但是我发现只写时间顺序日志表是常见设计模式的一个例外,因为你在这里处理的原因。

But stay away from triggers. As far as possible. The simplest solution is a primary table of the type you're talking about here, with a simple robust off-the-shelf time-proven replication mechanism.

但要远离触发器。越远越好。最简单的解决方案是您在此处讨论的类型的主表,具有简单强大的现成的经过时间验证的复制机制。

(BTW - Large tables don't slow down quickly if they are well designed - they slow down slowly.)

(顺便说一句 - 如果设计得很好,大型桌子不会很​​快减速 - 它们会慢慢减速。)

#4


If you do not need to search the recent log records, there is another option: Don't use a database at all. Instead, write the log info to a file and rotate the filename every night. When a file has been written, you can then start a background job to import the data directly into the archive database.

如果您不需要搜索最近的日志记录,还有另一种选择:根本不使用数据库。而是将日志信息写入文件并每晚旋转文件名。写入文件后,您可以启动后台作业以将数据直接导入存档数据库。

Databases are not always the best option, especially for log files :)

数据库并不总是最好的选择,特别是对于日志文件:)