MySQL性能:单个表和分区上的多个表与索引

时间:2022-09-16 11:55:44

I am wondering what is more efficient and faster in performance:
Having an index on one big table or multiple smaller tables without indexes?

我想知道什么是更高效和更快的性能:在一个大表或多个没有索引的小表上有索引?

Since this is a pretty abstract problem let me make it more practical:
I have one table with statistics about users (20,000 users and about 30 million rows overall). The table has about 10 columns including the user_id, actions, timestamps, etc.
Most common applications are: Inserting data by user_id and retrieving data by user_id (SELECT statements never include multiple user_id's).

由于这是一个相当抽象的问题,让我让它更实用:我有一个表有用户统计数据(20,000个用户和大约3000万行)。该表包含大约10列,包括user_id,actions,timestamps等。最常见的应用程序是:通过user_id插入数据并通过user_id检索数据(SELECT语句从不包含多个user_id)。

Now so far I have an INDEX on the user_id and the query looks something like this

到目前为止,我在user_id上有一个INDEX,查询看起来像这样

SELECT * FROM statistics WHERE user_id = 1

Now, with more and more rows the table gets slower and slower. INSERT statements slow down because the INDEX gets bigger and bigger; SELECT statements slow down, well, because there are more rows to search through.

现在,随着越来越多的行,表变得越来越慢。 INSERT语句因为INDEX变得越来越大而变慢; SELECT语句慢下来,因为有更多行要搜索。

Now I was wondering why not have one statistics table for each user and change the query syntax to something like this instead:

现在我想知道为什么不为每个用户提供一个统计表,并将查询语法更改为以下内容:

SELECT * FROM statistics_1

where 1 represents the user_id obviously.
This way, no INDEX is needed and there is far less data in each table, so INSERT and SELECT statements should be much faster.

其中1代表user_id。这样,不需要INDEX,并且每个表中的数据要少得多,因此INSERT和SELECT语句应该更快。

Now my questions again:
Are there any real world disadvantages to handle so many tables (in my case 20,000) instead of using of using one table with an INDEX?
Would my approach actually speed things up or might the lookup for the table eventually slow down things more than everything?

现在再次提出问题:处理如此多的表(在我的情况下是20,000)而不是使用一个带INDEX的表,是否有任何现实世界的缺点?我的方法实际上会加快速度吗?或者表格的查找最终会减慢比一切更多的东西吗?

4 个解决方案

#1


75  

Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.

创建20,000个表是个坏主意。不久之后你需要40,000个表,然后更多。

I called this syndrome Metadata Tribbles in my book SQL Antipatterns. You see this happen every time you plan to create a "table per X" or a "column per X".

我在我的书SQL Antipatterns中称这种综合症为元数据Tribbles。每次计划创建“每X表”或“每X列”时,都会发生这种情况。

This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.

当您拥有数万个表时,这确实会导致真正的性能问题。每个表都需要MySQL来维护内部数据结构,文件描述符,数据字典等。

There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?

还有实际的操作后果。您真的想要创建一个系统,要求您在每次新用户注册时创建新表吗?

Instead, I'd recommend you use MySQL Partitioning.

相反,我建议你使用MySQL Partitioning。

Here's an example of partitioning the table:

这是分区表的一个例子:

CREATE TABLE statistics (
  id INT AUTO_INCREMENT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY (id, user_id)
) PARTITION BY HASH(user_id) PARTITIONS 101;

This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.

这为您提供了定义一个逻辑表的好处,同时还将表划分为多个物理表,以便在查询分区键的特定值时更快地访问。

For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:

例如,当您运行类似示例的查询时,MySQL只访问包含特定user_id的正确分区:

mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: statistics
   partitions: p1    <--- this shows it touches only one partition 
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 2
        Extra: Using where; Using index

The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.

分区的HASH方法意味着行通过整数分区键的模数放置在分区中。这意味着许多user_id映射到同一个分区,但每个分区平均只有1 / N的行数(其中N是分区数)。并且您使用恒定数量的分区定义表,因此您不必在每次获得新用户时对其进行扩展。

You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.

您可以选择任意数量的分区(最多1024个)(或MySQL 5.6中的8192个分区),但有些人在报告这些分区时报告了性能问题。

It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.

建议使用素数分区。如果您的user_id值遵循模式(例如仅使用偶数),则使用素数分区有助于更均匀地分布数据。


Re your questions in comment:

在评论中回答你的问题:

How could I determine a resonable number of partitions?

我怎样才能确定合理数量的分区?

For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.

对于HASH分区,如果您使用上面示例中显示的101个分区,那么任何给定的分区平均有大约1%的行。您说您的统计信息表有3000万行,因此如果您使用此分区,则每个分区只有300k行。 MySQL更容易阅读。您也可以(也应该)使用索引 - 每个分区都有自己的索引,并且它只有整个未分区表上的索引的1%。

So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?

因此,您如何确定合理数量的分区的答案是:您的整个表有多大,以及您希望分区平均有多大?

Shouldn't the amount of partitions grow over time? If so: How can I automate that?

分区数量不应该随着时间的推移而增长吗?如果是这样:我如何自动化?

The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.

如果使用HASH分区,则分区数不一定需要增长。最终你可能总共有300亿行,但我发现当你的数据量增长了几个数量级时,无论如何都需要一个新的架构。如果您的数据增长很大,则可能需要对多个服务器进行分片以及分区为多个表。

That said, you can re-partition a table with ALTER TABLE:

也就是说,您可以使用ALTER TABLE重新分区表:

ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;

This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.

这必须重组表(就像大多数ALTER TABLE更改一样),所以期望它需要一段时间。

You may want to monitor the size of data and indexes in partitions:

您可能希望监视分区中的数据和索引的大小:

SELECT table_schema, table_name, table_rows, data_length, index_length
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE partition_method IS NOT NULL;

Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.

与任何表一样,您希望活动索引的总大小适合您的缓冲池,因为如果MySQL在SELECT查询期间必须将部分索引交换进缓冲池,性能会受到影响。

If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html

如果使用RANGE或LIST分区,则添加,删除,合并和拆分分区更为常见。请参阅http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html

I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.

我鼓励您阅读有关分区的手册部分,并查看这个漂亮的演示文稿:使用MySQL 5.1分区提升性能。

#2


4  

It probably depends on the type of queries you plan on making often, and the best way to know for sure is to just implement a prototype of both and do some performance tests.

它可能取决于您计划经常进行的查询类型,而确定的最佳方法是只实现两者的原型并进行一些性能测试。

With that said, I would expect that a single (large) table with an index will do better overall because most DBMS systems are heavily optimized to deal with the exact situation of finding and inserting data into large tables. If you try to make many little tables in hopes of improving performance, you're kindof fighting the optimizer (which is usually better).

话虽如此,我希望带有索引的单个(大)表总体上会做得更好,因为大多数DBMS系统都经过大量优化,可以处理查找和将数据插入大表的确切情况。如果你试图制作许多小桌子以期提高性能,那么你就可以对抗优化器(这通常更好)。

Also, keep in mind that one table is probably more practical for the future. What if you want to get some aggregate statistics over all users? Having 20 000 tables would make this very hard and inefficient to execute. It's worth considering the flexibility of these schemas as well. If you partition your tables like that, you might be designing yourself into a corner for the future.

另外,请记住,一张表可能对未来更实用。如果您想获得所有用户的汇总统计信息,该怎么办?拥有20 000个表将使执行起来非常困难和低效。值得考虑这些模式的灵活性。如果你像这样对表进行分区,你可能会将自己设计成一个未来的角落。

#3


1  

There is little to add to Bill Karwins answer. But one hint is: check if all the data for the user is needed in complete detail over all the time.

Bill Karwins的答案几乎没有什么可补充的。但有一个提示是:检查是否需要在所有时间内完整详细地提供用户的所有数据。

If you want to give usage statistics or number of visits or those things, you usually will get not a granularity of single actions and seconds for, say, the year 2009 from todays view. So you could build aggregation tables and a archive-table (not engine archive, of course) to have the recent data on action- base and an overview over the older actions.

如果您想提供使用情况统计信息或访问次数或这些内容,您通常不会从今天的视图中获得单个操作和秒的粒度,例如2009年。因此,您可以构建聚合表和归档表(当然不是引擎归档),以获取有关基于操作的最新数据和旧操作的概述。

Old actions don't change, I think.

我想,旧的行为不会改变。

And you still can go into detail from the aggregation with a week_id in the archive-table for example.

您仍然可以使用archive-table中的week_id从聚合中详细说明。

#4


0  

Intead of going from 1 table to 1 table per user, you can use partitioning to hit a number of tables/table size ratio somewhere in the middle.

每个用户从1个表到1个表的Intead,您可以使用分区在中间的某个位置点击多个表/表大小比例。

You can also keep stats on users to try to move 'active' users into 1 table to reduce the number of tables that you have to access over time.

您还可以保留用户的统计信息,以尝试将“活动”用户移动到1个表中,以减少您必须随时访问的表的数量。

The bottom line is that there is a lot you can do, but largely you have to build prototypes and tests and just evaluate the performance impacts of various changes you are making.

最重要的是,您可以做很多事情,但主要是您必须构建原型和测试,并且只评估您正在进行的各种更改对性能的影响。

#1


75  

Creating 20,000 tables is a bad idea. You'll need 40,000 tables before long, and then more.

创建20,000个表是个坏主意。不久之后你需要40,000个表,然后更多。

I called this syndrome Metadata Tribbles in my book SQL Antipatterns. You see this happen every time you plan to create a "table per X" or a "column per X".

我在我的书SQL Antipatterns中称这种综合症为元数据Tribbles。每次计划创建“每X表”或“每X列”时,都会发生这种情况。

This does cause real performance problems when you have tens of thousands of tables. Each table requires MySQL to maintain internal data structures, file descriptors, a data dictionary, etc.

当您拥有数万个表时,这确实会导致真正的性能问题。每个表都需要MySQL来维护内部数据结构,文件描述符,数据字典等。

There are also practical operational consequences. Do you really want to create a system that requires you to create a new table every time a new user signs up?

还有实际的操作后果。您真的想要创建一个系统,要求您在每次新用户注册时创建新表吗?

Instead, I'd recommend you use MySQL Partitioning.

相反,我建议你使用MySQL Partitioning。

Here's an example of partitioning the table:

这是分区表的一个例子:

CREATE TABLE statistics (
  id INT AUTO_INCREMENT NOT NULL,
  user_id INT NOT NULL,
  PRIMARY KEY (id, user_id)
) PARTITION BY HASH(user_id) PARTITIONS 101;

This gives you the benefit of defining one logical table, while also dividing the table into many physical tables for faster access when you query for a specific value of the partition key.

这为您提供了定义一个逻辑表的好处,同时还将表划分为多个物理表,以便在查询分区键的特定值时更快地访问。

For example, When you run a query like your example, MySQL accesses only the correct partition containing the specific user_id:

例如,当您运行类似示例的查询时,MySQL只访问包含特定user_id的正确分区:

mysql> EXPLAIN PARTITIONS SELECT * FROM statistics WHERE user_id = 1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: statistics
   partitions: p1    <--- this shows it touches only one partition 
         type: index
possible_keys: NULL
          key: PRIMARY
      key_len: 8
          ref: NULL
         rows: 2
        Extra: Using where; Using index

The HASH method of partitioning means that the rows are placed in a partition by a modulus of the integer partition key. This does mean that many user_id's map to the same partition, but each partition would have only 1/Nth as many rows on average (where N is the number of partitions). And you define the table with a constant number of partitions, so you don't have to expand it every time you get a new user.

分区的HASH方法意味着行通过整数分区键的模数放置在分区中。这意味着许多user_id映射到同一个分区,但每个分区平均只有1 / N的行数(其中N是分区数)。并且您使用恒定数量的分区定义表,因此您不必在每次获得新用户时对其进行扩展。

You can choose any number of partitions up to 1024 (or 8192 in MySQL 5.6), but some people have reported performance problems when they go that high.

您可以选择任意数量的分区(最多1024个)(或MySQL 5.6中的8192个分区),但有些人在报告这些分区时报告了性能问题。

It is recommended to use a prime number of partitions. In case your user_id values follow a pattern (like using only even numbers), using a prime number of partitions helps distribute the data more evenly.

建议使用素数分区。如果您的user_id值遵循模式(例如仅使用偶数),则使用素数分区有助于更均匀地分布数据。


Re your questions in comment:

在评论中回答你的问题:

How could I determine a resonable number of partitions?

我怎样才能确定合理数量的分区?

For HASH partitioning, if you use 101 partitions like I show in the example above, then any given partition has about 1% of your rows on average. You said your statistics table has 30 million rows, so if you use this partitioning, you would have only 300k rows per partition. That is much easier for MySQL to read through. You can (and should) use indexes as well -- each partition will have its own index, and it will be only 1% as large as the index on the whole unpartitioned table would be.

对于HASH分区,如果您使用上面示例中显示的101个分区,那么任何给定的分区平均有大约1%的行。您说您的统计信息表有3000万行,因此如果您使用此分区,则每个分区只有300k行。 MySQL更容易阅读。您也可以(也应该)使用索引 - 每个分区都有自己的索引,并且它只有整个未分区表上的索引的1%。

So the answer to how can you determine a reasonable number of partitions is: how big is your whole table, and how big do you want the partitions to be on average?

因此,您如何确定合理数量的分区的答案是:您的整个表有多大,以及您希望分区平均有多大?

Shouldn't the amount of partitions grow over time? If so: How can I automate that?

分区数量不应该随着时间的推移而增长吗?如果是这样:我如何自动化?

The number of partitions doesn't necessarily need to grow if you use HASH partitioning. Eventually you may have 30 billion rows total, but I have found that when your data volume grows by orders of magnitude, that demands a new architecture anyway. If your data grow that large, you probably need sharding over multiple servers as well as partitioning into multiple tables.

如果使用HASH分区,则分区数不一定需要增长。最终你可能总共有300亿行,但我发现当你的数据量增长了几个数量级时,无论如何都需要一个新的架构。如果您的数据增长很大,则可能需要对多个服务器进行分片以及分区为多个表。

That said, you can re-partition a table with ALTER TABLE:

也就是说,您可以使用ALTER TABLE重新分区表:

ALTER TABLE statistics PARTITION BY HASH(user_id) PARTITIONS 401;

This has to restructure the table (like most ALTER TABLE changes), so expect it to take a while.

这必须重组表(就像大多数ALTER TABLE更改一样),所以期望它需要一段时间。

You may want to monitor the size of data and indexes in partitions:

您可能希望监视分区中的数据和索引的大小:

SELECT table_schema, table_name, table_rows, data_length, index_length
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE partition_method IS NOT NULL;

Like with any table, you want the total size of active indexes to fit in your buffer pool, because if MySQL has to swap parts of indexes in and out of the buffer pool during SELECT queries, performance suffers.

与任何表一样,您希望活动索引的总大小适合您的缓冲池,因为如果MySQL在SELECT查询期间必须将部分索引交换进缓冲池,性能会受到影响。

If you use RANGE or LIST partitioning, then adding, dropping, merging, and splitting partitions is much more common. See http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html

如果使用RANGE或LIST分区,则添加,删除,合并和拆分分区更为常见。请参阅http://dev.mysql.com/doc/refman/5.6/en/partitioning-management-range-list.html

I encourage you to read the manual section on partitioning, and also check out this nice presentation: Boost Performance With MySQL 5.1 Partitions.

我鼓励您阅读有关分区的手册部分,并查看这个漂亮的演示文稿:使用MySQL 5.1分区提升性能。

#2


4  

It probably depends on the type of queries you plan on making often, and the best way to know for sure is to just implement a prototype of both and do some performance tests.

它可能取决于您计划经常进行的查询类型,而确定的最佳方法是只实现两者的原型并进行一些性能测试。

With that said, I would expect that a single (large) table with an index will do better overall because most DBMS systems are heavily optimized to deal with the exact situation of finding and inserting data into large tables. If you try to make many little tables in hopes of improving performance, you're kindof fighting the optimizer (which is usually better).

话虽如此,我希望带有索引的单个(大)表总体上会做得更好,因为大多数DBMS系统都经过大量优化,可以处理查找和将数据插入大表的确切情况。如果你试图制作许多小桌子以期提高性能,那么你就可以对抗优化器(这通常更好)。

Also, keep in mind that one table is probably more practical for the future. What if you want to get some aggregate statistics over all users? Having 20 000 tables would make this very hard and inefficient to execute. It's worth considering the flexibility of these schemas as well. If you partition your tables like that, you might be designing yourself into a corner for the future.

另外,请记住,一张表可能对未来更实用。如果您想获得所有用户的汇总统计信息,该怎么办?拥有20 000个表将使执行起来非常困难和低效。值得考虑这些模式的灵活性。如果你像这样对表进行分区,你可能会将自己设计成一个未来的角落。

#3


1  

There is little to add to Bill Karwins answer. But one hint is: check if all the data for the user is needed in complete detail over all the time.

Bill Karwins的答案几乎没有什么可补充的。但有一个提示是:检查是否需要在所有时间内完整详细地提供用户的所有数据。

If you want to give usage statistics or number of visits or those things, you usually will get not a granularity of single actions and seconds for, say, the year 2009 from todays view. So you could build aggregation tables and a archive-table (not engine archive, of course) to have the recent data on action- base and an overview over the older actions.

如果您想提供使用情况统计信息或访问次数或这些内容,您通常不会从今天的视图中获得单个操作和秒的粒度,例如2009年。因此,您可以构建聚合表和归档表(当然不是引擎归档),以获取有关基于操作的最新数据和旧操作的概述。

Old actions don't change, I think.

我想,旧的行为不会改变。

And you still can go into detail from the aggregation with a week_id in the archive-table for example.

您仍然可以使用archive-table中的week_id从聚合中详细说明。

#4


0  

Intead of going from 1 table to 1 table per user, you can use partitioning to hit a number of tables/table size ratio somewhere in the middle.

每个用户从1个表到1个表的Intead,您可以使用分区在中间的某个位置点击多个表/表大小比例。

You can also keep stats on users to try to move 'active' users into 1 table to reduce the number of tables that you have to access over time.

您还可以保留用户的统计信息,以尝试将“活动”用户移动到1个表中,以减少您必须随时访问的表的数量。

The bottom line is that there is a lot you can do, but largely you have to build prototypes and tests and just evaluate the performance impacts of various changes you are making.

最重要的是,您可以做很多事情,但主要是您必须构建原型和测试,并且只评估您正在进行的各种更改对性能的影响。