处理大型数据库表的策略

时间:2021-06-02 23:31:27

I'm looking at building a Rails application which will have some pretty large tables with upwards of 500 million rows. To keep things snappy I'm currently looking into how a large table can be split to more manageable chunks. I see that as of MySQL 5.1 there is a partitioning option and that's a possible option but I don't like the way the column that determines the partitioning has to be part of the primary key on the table.

我正在寻找构建一个Rails应用程序,它将拥有一些超过5亿行的超大表。为了让事情变得活泼,我正在研究如何将大表拆分为更易于管理的块。我看到,从MySQL 5.1开始,有一个分区选项,这是一个可能的选项,但我不喜欢确定分区的列必须是表的主键的一部分的方式。

What I'd really like to do is split the table that a AR model writes to based upon the values written but as far as I am aware there is no way to do this - does anyone have any suggestions as to how I might implement this or any alternative strategies?

我真正想要做的是根据写入的值拆分AR模型写入的表,但据我所知,没有办法做到这一点 - 是否有人对如何实现这一点有任何建议或任何替代策略?

Thanks

谢谢

Arfon

Arfon

3 个解决方案

#1


5  

Partition columns in MySQL are not limited to primary keys. In fact, a partition column does not have to be a key at all (though one will be created for it transparently). You can partition by RANGE, HASH, KEY and LIST (which is similar to RANGE only that it is a set of discrete values). Read the MySQL manual for an overview of partioning types.

MySQL中的分区列不限于主键。事实上,分区列根本不必是一个密钥(尽管将透明地创建一个分区列)。您可以按RANGE,HASH,KEY和LIST进行分区(仅与RANGE类似,它是一组离散值)。阅读MySQL手册,了解分区类型的概述。

There are alternative solutions such as HScale - a middleware plug-in that transparently partitions tables based on certain criteria. HiveDB is an open-source framework for horizontal partioning for MySQL.

还有其他解决方案,例如HScale - 一个中间件插件,可根据特定条件透明地对表进行分区。 HiveDB是一个用于MySQL水平分区的开源框架。

In addition to sharding and partioning you should employ some sort of clustering. The simplest setup is a replication based setup that helps you spread the load over several physical servers. You should also consider more advanced clustering solutions such as MySQL cluster (probably not an option due to the size of your database) and clustering middleware such as Sequioa.

除了分片和分区之外,您还应该使用某种聚类。最简单的设置是基于复制的设置,可帮助您将负载分散到多个物理服务器上。您还应该考虑更高级的群集解决方案,例如MySQL群集(可能由于数据库的大小而无法选择)和群集中间件(如Sequioa)。

I actually asked a relevant question regarding scaling with MySQL here on stack-overflow some time ago, which I ended up answering myself several days later after collecting a lot of information on the subject. Might be relevant for you as well.

我实际上在一段时间之前就堆栈溢出问题提出了一个关于MySQL缩放的相关问题,几天后我收集了大量有关该主题的信息后,我最终回答了自己的问题。也可能与你相关。

#2


1  

If you want to split your datas by time, the following solution may fit to your need. You can probably use MERGE tables;

如果您想按时间拆分数据,以下解决方案可能符合您的需要。您可以使用MERGE表;

Let's assume your table is called MyTable and that you need one table per week

假设您的表名为MyTable,并且您每周需要一个表

  1. Your app always logs in the same table
  2. 您的应用始终登录同一个表格
  3. A weekly job atomically renames your table and recreates an empty one: MyTable is renamed to MyTable-Year-WeekNumber, and a fresh empty MyTable is created
  4. 每周作业自动重命名您的表并重新创建一个空表:MyTable重命名为MyTable-Year-WeekNumber,并创建一个新的空MyTable
  5. Merge tables are dropped and recreated.
  6. 删除并重新创建合并表。

If you want to get all the datas of the past three months, you create a merge table which will include only the tables from the last 3 months. Create as many merge tables as you need different periods. If you can not include the table in which datas are currently inserted (MyTable in our example), you'll be even more happy, as you won't have any read / write concurrency

如果要获取过去三个月的所有数据,可以创建一个合并表,其中仅包含过去3个月的表。根据需要创建尽可能多的合并表。如果您不能包含当前插入数据的表(在我们的示例中为MyTable),您会更高兴,因为您将没有任何读/写并发

#3


1  

You can handle this entirely in Active Record using DataFabric.

您可以使用DataFabric在Active Record中完全处理此问题。

It's not that complicated to implement similar behavior yourself if that's not suitable. Google sharding for a lot of discussion on the architectural pattern of handling table partitioning within the app tier. It has the advantages of avoiding middleware or depending on db vender specific features. On the other hand it is more code in your app that you're responsible for.

如果不合适的话,自己实施类似行为并不复杂。谷歌分享了关于在应用层内处理表分区的架构模式的大量讨论。它具有避免中间件或依赖于db vender特定功能的优点。另一方面,您负责的应用程序中有更多代码。

#1


5  

Partition columns in MySQL are not limited to primary keys. In fact, a partition column does not have to be a key at all (though one will be created for it transparently). You can partition by RANGE, HASH, KEY and LIST (which is similar to RANGE only that it is a set of discrete values). Read the MySQL manual for an overview of partioning types.

MySQL中的分区列不限于主键。事实上,分区列根本不必是一个密钥(尽管将透明地创建一个分区列)。您可以按RANGE,HASH,KEY和LIST进行分区(仅与RANGE类似,它是一组离散值)。阅读MySQL手册,了解分区类型的概述。

There are alternative solutions such as HScale - a middleware plug-in that transparently partitions tables based on certain criteria. HiveDB is an open-source framework for horizontal partioning for MySQL.

还有其他解决方案,例如HScale - 一个中间件插件,可根据特定条件透明地对表进行分区。 HiveDB是一个用于MySQL水平分区的开源框架。

In addition to sharding and partioning you should employ some sort of clustering. The simplest setup is a replication based setup that helps you spread the load over several physical servers. You should also consider more advanced clustering solutions such as MySQL cluster (probably not an option due to the size of your database) and clustering middleware such as Sequioa.

除了分片和分区之外,您还应该使用某种聚类。最简单的设置是基于复制的设置,可帮助您将负载分散到多个物理服务器上。您还应该考虑更高级的群集解决方案,例如MySQL群集(可能由于数据库的大小而无法选择)和群集中间件(如Sequioa)。

I actually asked a relevant question regarding scaling with MySQL here on stack-overflow some time ago, which I ended up answering myself several days later after collecting a lot of information on the subject. Might be relevant for you as well.

我实际上在一段时间之前就堆栈溢出问题提出了一个关于MySQL缩放的相关问题,几天后我收集了大量有关该主题的信息后,我最终回答了自己的问题。也可能与你相关。

#2


1  

If you want to split your datas by time, the following solution may fit to your need. You can probably use MERGE tables;

如果您想按时间拆分数据,以下解决方案可能符合您的需要。您可以使用MERGE表;

Let's assume your table is called MyTable and that you need one table per week

假设您的表名为MyTable,并且您每周需要一个表

  1. Your app always logs in the same table
  2. 您的应用始终登录同一个表格
  3. A weekly job atomically renames your table and recreates an empty one: MyTable is renamed to MyTable-Year-WeekNumber, and a fresh empty MyTable is created
  4. 每周作业自动重命名您的表并重新创建一个空表:MyTable重命名为MyTable-Year-WeekNumber,并创建一个新的空MyTable
  5. Merge tables are dropped and recreated.
  6. 删除并重新创建合并表。

If you want to get all the datas of the past three months, you create a merge table which will include only the tables from the last 3 months. Create as many merge tables as you need different periods. If you can not include the table in which datas are currently inserted (MyTable in our example), you'll be even more happy, as you won't have any read / write concurrency

如果要获取过去三个月的所有数据,可以创建一个合并表,其中仅包含过去3个月的表。根据需要创建尽可能多的合并表。如果您不能包含当前插入数据的表(在我们的示例中为MyTable),您会更高兴,因为您将没有任何读/写并发

#3


1  

You can handle this entirely in Active Record using DataFabric.

您可以使用DataFabric在Active Record中完全处理此问题。

It's not that complicated to implement similar behavior yourself if that's not suitable. Google sharding for a lot of discussion on the architectural pattern of handling table partitioning within the app tier. It has the advantages of avoiding middleware or depending on db vender specific features. On the other hand it is more code in your app that you're responsible for.

如果不合适的话,自己实施类似行为并不复杂。谷歌分享了关于在应用层内处理表分区的架构模式的大量讨论。它具有避免中间件或依赖于db vender特定功能的优点。另一方面,您负责的应用程序中有更多代码。