大型数据库上的聚合：最佳平台？

I have a postgres database with several million rows, which drives a web app. The data is static: users don't write to it.

我有一个包含数百万行的postgres数据库,它可以驱动一个Web应用程序。数据是静态的:用户不会写入数据。

I would like to be able to offer users query-able aggregates (e.g. the sum of all rows with a certain foreign key value), but the size of the database now means it takes 10-15 minutes to calculate such aggregates.

我希望能够为用户提供可查询的聚合(例如,具有特定外键值的所有行的总和),但现在数据库的大小意味着计算此类聚合需要10-15分钟。

Should I:

start pre-calculating aggregates in the database (since the data is static)

开始预先计算数据库中的聚合(因为数据是静态的)

move away from postgres and use something else?

远离postgres并使用别的东西?

The only problem with 1. is that I don't necessarily know which aggregates users will want, and it will obviously increase the size of the database even further.

1.唯一的问题是我不一定知道用户想要哪些聚合,而且显然会进一步增加数据库的大小。

If there was a better solution than postgres for such problems, then I'd be very grateful for any suggestions.

如果对于这些问题有比postgres更好的解决方案,那么我将非常感谢任何建议。

7 个解决方案

#1

You are trying to solve an OLAP (On-Line Analytical Process) data base structure problem with an OLTP (On-Line Transactional Process) database structure.

您正尝试使用OLTP(联机事务处理)数据库结构解决OLAP(联机分析进程)数据库结构问题。

You should build another set of tables that store just the aggregates and update these tables in the middle of the night. That way your customers can query the aggregate set of tables and it won't interfere with the on-line transation proceessing system at all.

您应该构建另一组表,这些表只存储聚合并在半夜更新这些表。这样,您的客户就可以查询汇总表集,并且根本不会干扰联机转换程序系统。

The only caveate is the aggregate data will always be one day behind.

唯一的问题是汇总数据总是落后一天。

#2

Yes
Possibly. Presumably there are a whole heap of things you would need to consider before changing your RDBMS. If you moved to SQL Server, you would use Indexed views to accomplish this: Improving Performance with SQL Server 2008 Indexed Views

有可能。据推测,在更改RDBMS之前,您需要考虑一大堆事项。如果您转移到SQL Server,则可以使用索引视图来实现此目的:使用SQL Server 2008索引视图提高性能

#3

If you store the aggregates in an intermediate Object (something like MyAggragatedResult), you could consider a caching proxy:

如果将聚合存储在中间对象(类似于MyAggragatedResult)中,则可以考虑使用缓存代理:

class ResultsProxy { calculateResult(param1, param2) { .. retrieve from cache .. if not found, calculate and store in cache }

class ResultsProxy {calculateResult(param1,param2){..从缓存中检索..如果找不到,计算并存储在缓存中}

}

There are quite a few caching frameworks for java, and most like for other languages/environments such as .Net as well. These solution can take care of invalidation (how long should a result be stored in memory), and memory-management (remove old cache items when reaching memory limit, etc.).

java有很多缓存框架,大多数情况下也适用于其他语言/环境,如.Net。这些解决方案可以处理失效(结果应该存储在内存中多长时间)和内存管理(在达到内存限制时删除旧的缓存项等)。

#4

If you have a set of commonly-queried aggregates, it might be best to create an aggregate table that is maintained by triggers (or an observer pattern tied to your OR/M).

如果您有一组常见查询聚合,则最好创建一个由触发器维护的聚合表(或与您的OR / M绑定的观察者模式)。

Example: say you're writing an accounting system. You keep all the debits and credits in a General Ledger table (GL). Such a table can quickly accumulate tens of millions of rows in a busy organization. To find the balance of a particular account on the balance sheet as of a given day, you would normally have to calculate the sum of all debits and credits to that account up to that date, a calculation that could take several seconds even with a properly indexed table. Calculating all figures of a balance sheet could take minutes.

示例:假设您正在编写会计系统。您将所有借方和贷方保留在总帐表(GL)中。这样的表可以在繁忙的组织中快速累积数千万行。要在特定日期找到资产负债表上特定账户的余额,您通常必须计算截至该日期该账户的所有借方和贷方的总和,即使正确计算也可能需要几秒钟的计算索引表。计算资产负债表的所有数字可能需要几分钟。

Instead, you could define an account_balance table. For each account and dates or date ranges of interest (usually each month's end), you maintain a balance figure by using a trigger on the GL table to update balances by adding each delta individually to all applicable balances. This spreads the cost of aggregating these figures over each individual persistence to the database, which will likely reduce it to a negligible performance hit when saving, and will decrease the cost of getting the data from a massive linear operation to a near-constant one.

相反,您可以定义account_balance表。对于每个帐户和感兴趣的日期或日期范围(通常是每个月的结束),您可以通过使用GL表上的触发器来维持余额数字,以通过将每个增量单独添加到所有适用的余额来更新余额。这将在每个持久性上聚合这些数字的成本分摊到数据库,这可能会在保存时将其降低到可忽略的性能损失,并且将降低将数据从大规模线性操作获得到接近常数的成本。

#5

For that data volume you shouldn't have to move off Postgres.

对于该数据量,您不必离开Postgres。

I'd look to tuning first - 10-15 minutes seems pretty excessive for 'a few million rows'. This ought to be just a few seconds. Note that the out-of-the box config settings for Postgres don't (or at least didn't) allocate much disk buffer memory. You might look at that also.

我希望首先调整 - 对于'几百万行'来说,10-15分钟似乎相当过分。这应该只是几秒钟。请注意,Postgres的开箱即用配置设置不会(或至少没有)分配很多磁盘缓冲区内存。你也可以看一下。

More complex solutions involve implementing some sort of data mart or an OLAP front-end such as Mondrian over the database. The latter does pre-calculate aggregates and caches them.

更复杂的解决方案涉及在数据库上实现某种数据集市或OLAP前端,如Mondrian。后者确实预先计算聚合并缓存它们。

#6

If you have a set of common aggregates you can calculate it before hand (like, well, once a week) in a separate table and/or columns and users get it fast.

如果您有一组常用聚合,您可以在单独的表和/或列中预先计算它(例如,每周一次),并且用户可以快速计算。

But I'd seeking the tuning way too - revise your indexing strategy. As your database is read only, you don't need to worry about index updating overhead.

但我也在寻求调整方式 - 修改你的索引策略。由于您的数据库是只读的,因此您无需担心索引更新开销。

Revise your database configuration, maybe you can squeeze some performance of it - normally default configurations are targeted to easy the life of first-time users and become short-sighted fastly with large databases.

修改您的数据库配置,也许您可以挤压它的一些性能 - 通常默认配置的目标是简化首次使用者的生活,并快速与大型数据库短视。

Maybe even some denormalization can speed up things after you revised your indexing and database configuration - and falls in the situation that you need even more performance, but try it as a last resort.

也许甚至一些非规范化可以在您修改索引和数据库配置后加快速度 - 并且处于需要更高性能的情况,但尝试将其作为最后的手段。

#7

Oracle supports a concept called Query Rewrite. The idea is this:

Oracle支持一种名为Query Rewrite的概念。这个想法是这样的:

When you want a lookup (WHERE ID = val) to go faster, you add an index. You don't have to tell the optimizer to use the index - it just does. You don't have to change the query to read FROM the index... you hit the same table as you always did but now instead of reading every block in the table, it reads a few index blocks and knows where to go in the table.

如果希望查找(WHERE ID = val)更快,则添加索引。您不必告诉优化器使用索引 - 它只是这样做。您不必将查询更改为从索引中读取...您按照以前的方式访问同一个表但现在不是读取表中的每个块,而是读取一些索引块并知道在哪里进入表。

Imagine if you could add something like that for aggregation. Something that the optimizer would just 'use' without being told to change. Let's say you have a table called DAILY_SALES for the last ten years. Some sales managers want monthly sales, some want quarterly, some want yearly.

想象一下,如果你可以添加类似的东西进行聚合。优化器只是“使用”而不被告知要改变的东西。假设您在过去十年中有一个名为DAILY_SALES的表。一些销售经理希望每月销售,一些希望每季度,一些希望每年。

You could maintain a bunch of extra tables that hold those aggregations and then you'd tell the users to change their query to use a different table. In Oracle, you'd build those as materialized views. You do no work except defining the MV and an MV Log on the source table. Then if a user queries DAILY_SALES for a sum by month, ORACLE will change your query to use an appropriate level of aggregation. The key is WITHOUT changing the query at all.

您可以维护一堆额外的表来保存这些聚合,然后您告诉用户更改其查询以使用不同的表。在Oracle中,您将构建它们作为物化视图。除了在源表上定义MV和MV Log之外,您不会工作。然后,如果用户按月向DAILY_SALES查询总和,ORACLE将更改您的查询以使用适当的聚合级别。关键是没有改变查询。

Maybe other DB's support that... but this is clearly what you are looking for.

也许其他DB的支持......但这显然是你在寻找的东西。

#1