如何优化MySQL中的大表,何时可以从分区中受益?

时间:2022-09-16 11:21:38

In summary, date range partitioning and memory configuration achieved my goal.

I needed to increase memory allocated to innodb_buffer_pool_size as the default 8M was far too low. Rick James recommends 70% of RAM for this setting, his has a lot of great information.

我需要增加分配给innodb_buffer_pool_size的内存,因为默认的8M太低了。 Rick James推荐70%的RAM用于此设置,他有很多很棒的信息。

Edlerd was correct with both suggestions :-)

Edlerd对两个建议都是正确的:-)

I split my data into monthly partitions and then ran a 6,000 row response query which originally took between 6 to 12 seconds. It now completes in less than a second (.984/.031). I ran this using the default innodb buffer size (innodb_buffer_pool_size = 8M) to make sure it wasnt just the memory increase.

我将数据拆分为月分区,然后运行6,000行响应查询,最初需要6到12秒。它现在在不到一秒钟内完成(.984 / .031)。我使用默认的innodb缓冲区大小(innodb_buffer_pool_size = 8M)来运行它,以确保它不仅仅是内存增加。

I then set innodb_buffer_pool_size = 4G and ran the query with an even better response of .062/.032.

然后我设置innodb_buffer_pool_size = 4G并以更好的响应运行查询.062 / .032。

I’d also like to mention that increasing the memory has also improved the overall speed of my web application and service which receives and writes messages to this table, I am astounded at how much of a difference this configuration setting has made. The Time To First Byte (TTFB) from my web server is now almost on par with MySQL Workbench which at times would reach 20 seconds.

我还想提一下,增加内存也提高了我的Web应用程序和服务的整体速度,它接收和写入消息到这个表,我很惊讶这个配置设置有多大区别。我的网络服务器的第一个字节时间(TTFB)现在几乎与MySQL Workbench相当,有时会达到20秒。

I also found that the slow query log file was an excellent tool to identify issues, it was there that I saw it suggesting my innodb_buffer_pool_size was low and higlighted all the poor performing queries. This also identified areas where I needed to index other tables.

我还发现慢查询日志文件是识别问题的一个很好的工具,我在那里看到它表明我的innodb_buffer_pool_size很低并且突出了所有性能不佳的查询。这也确定了我需要索引其他表的区域。

EDIT 2016-11-12 SOLUTION

I am in the process of refactoring a large table that logs telemetry data, it has been running for about 4-5 months and has generated approx. 54 million records with an average row size approx. 380 bytes.

我正在重构一个记录遥测数据的大型表,它已经运行了大约4-5个月并且已经生成了大约。 5400万条记录,平均行数约为。 380个字节。

I have started to see some performance lag on one of my raw data queries that returns all logs for a device over a 24 hour period.

我已经开始看到我的一个原始数据查询的性能滞后,它会在24小时内返回设备的所有日志。

Initially I thought it was indexing, but I think it is the amount of I/O that needs to be processed by MySQL. A typical 24 hour query would contain 2.2k 3k to 9k records and I’d actually like to support an export of about 7 days.

最初我认为它是索引,但我认为这是MySQL需要处理的I / O量。一个典型的24小时查询将包含2.2k 3k到9k记录,我实际上想支持大约7天的导出。

I am not experienced in database performance tuning so still just learning the ropes. I am considering a few strategies.

我没有数据库性能调优的经验,所以仍然只是学习绳索。我正在考虑一些策略。

  1. Tweak compound indexes according to query for raw data, although I think my indexes are OK as the explain plan is showing 100% hit rate.
  2. 根据查询原始数据调整复合索引,虽然我认为我的索引没问题,因为解释计划显示100%命中率。
  3. Consider creating a covering index to include all rows needed
  4. 考虑创建覆盖索引以包括所需的所有行
  5. Implement ranged partitioning by date: a) Keep monthly partitions. E.g. last 6 months b) Move anything older to archive table.
  6. 按日期实施远程分区:a)保留每月分区。例如。最近6个月b)将旧的东西移到归档表。
  7. Create a separate table (vertical partitioning) with the raw data and join it with the IDs of the primary query table. Not sure this is my problem as my indexes are working.
  8. 使用原始数据创建单独的表(垂直分区),并将其与主查询表的ID连接。我的索引正在运行,不确定这是我的问题。
  9. Change my queries to pull data in batches with limits, then order by created date limit X and carry on until no more records are returned.
  10. 更改我的查询以批量提取数据,然后按创建的日期限制X排序并继续,直到不再返回任何记录。
  11. Review server configuration
  12. 查看服务器配置

1,2 (INDEXES): I’ll rework my indexes with my queries, but I think I am good here as Explain is showing 100% hit, unless I am reading this wrong.

1,2(INDEXES):我会用我的查询重写我的索引,但我认为我很好,因为Explain显示100%命中,除非我读错了。

I’ll try a covering index when they are rebuilt, but how do I determine the knock on effects of making a bad setting? E.G. insert speeds are compromised.

我会在重建时尝试覆盖索引,但是如何确定设置错误的效果呢?例如。插入速度受到影响。

How would I best monitor the performance of my table in a live environment?

如何在实时环境中最好地监控桌面的性能?

EDIT: I've just started using the slow log file which looks like a good tool for finding issues and I suppose a query on the performance_schema might be another option?

编辑:我刚刚开始使用慢速日志文件,它看起来像是一个很好的查找问题的工具,我想对performance_schema的查询可能是另一种选择?

3 (PARTITIONING): I have read a bit about partitions and not sure if the size of my data would make much of a difference.

3(PARTITIONING):我已经阅读了一些关于分区的内容,并且不确定我的数据大小是否会产生很大影响。

Rick James suggests >1M records, I’m at 54M and would like to keep around 300M prior to archiving, is my table is complex enough to benefit?

Rick James提出了大约1M的记录,我的目标是54M,并希望在归档之前保持300M左右,我的桌子是否足够复杂?

I have to test this out myself as I do not have experience with any of this stuff and it all theoretical to me. I just don’t want to go down this path if it isn’t suitable for my needs.

我必须自己测试一下,因为我没有任何这方面的经验,这对我来说都是理论上的。如果它不适合我的需要,我只是不想走这条路。

4 (Vertical partitioning via ‘joined’ detail table): I don’t I think am having table scan issues and I need all rows, so I am not sure this technique would be of benefit.

4(通过'加入'细节表进行垂直分区):我不认为我有表扫描问题而且我需要所有行,所以我不确定这种技术是否有益。

5 (Use limits and fetch again): Would this free up the server if I used less of its time in a single request? Would I see better I/O throughput at the cost of more commands on the same connection?

5(使用限制并再次获取):如果我在单个请求中使用较少的时间,这会释放服务器吗?我是否会在同一连接上以更多命令为代价看到更好的I / O吞吐量?

6 (Review Config): The other piece would be to review the default non developer configuration that is used when you install MySQL, perhaps there are some settings that can be adjusted? :-)

6(查看配置):另一部分是审查安装MySQL时使用的默认非开发人员配置,也许有一些设置可以调整? :-)

Thanks for reading, keen to hear any and all suggestions.

感谢阅读,热衷于听取任何和所有建议。

The following FYI:

以下FYI:

TABLE:

表:

CREATE TABLE `message_log` (
    `db_id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `db_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
    `created` datetime DEFAULT NULL,
    `device_id` int(10) unsigned NOT NULL,
    `display_name` varchar(50) DEFAULT NULL,
    `ignition` binary(1) DEFAULT NULL COMMENT 'This is actually IO8 from the falcom device',
    `sensor_a` float DEFAULT NULL,
    `sensor_b` float DEFAULT NULL,
    `lat` double DEFAULT NULL COMMENT 'default GPRMC format ddmm.mmmm \n',
    `lon` double DEFAULT NULL COMMENT 'default GPRMC longitude format dddmm.mmmm ',
    `heading` float DEFAULT NULL,
    `speed` float DEFAULT NULL,
    `pos_validity` char(1) DEFAULT NULL,
    `device_temp` float DEFAULT NULL,
    `device_volts` float DEFAULT NULL,
    `satellites` smallint(6) DEFAULT NULL, /* TINYINT will suffice */
    `navdist` double DEFAULT NULL,
    `navdist2` double DEFAULT NULL,
    `IO0` binary(1) DEFAULT NULL COMMENT 'Duress',
    `IO1` binary(1) DEFAULT NULL COMMENT 'Fridge On/Off',
    `IO2` binary(1) DEFAULT NULL COMMENT 'Not mapped',
    `msg_name` varchar(20) DEFAULT NULL, /* Will be removed */
    `msg_type` varchar(16) DEFAULT NULL, /* Will be removed */
    `msg_id` smallint(6) DEFAULT NULL,
    `raw` text, /* Not needed in primary query, considering adding to single table mapped to this ID or a UUID correlation ID to save on @ROWID query */
PRIMARY KEY (`db_id`),
KEY `Name` (`display_name`),
KEY `Created` (`created`),
KEY `DeviceID_AND_Created` (`device_id`,`created`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

DeviceID_AND_Created is the main index. I need the PK clustered index because I am using the record ID in a summary table that keeps track of the last message for a given device. Created would be the partition column, so I guess that would also be added to the PK cluster?

DeviceID_AND_Created是主索引。我需要PK聚簇索引,因为我在摘要表中使用记录ID来跟踪给定设备的最后一条消息。创建将是分区列,所以我想这也会添加到PK群集?

QUERY:

查询:

SELECT 
    ml.db_id, ml.db_created, ml.created, ml.device_id, ml.display_name, bin(ml.ignition) as `ignition`, 
    bin(ml.IO0) as `duress`, bin(ml.IO1) as `fridge`,ml.sensor_a, ml.sensor_b, ml.lat, ml.lon, ml.heading, 
    ml.speed,ml.pos_validity, ml.satellites, ml.navdist2, ml.navdist,ml.device_temp, ml.device_volts,ml.msg_id
FROM message_log ml 
WHERE ml.device_id = @IMEI
AND ml.created BETWEEN @STARTDATE AND DATE_ADD(@STARTDATE,INTERVAL 24 hour) 
ORDER BY ml.db_id;

This returns all logs for a given 24 hour period which at the moment is approx. 3k to 9k rows, average row size 381 bytes and will be reduced once I remove one of the TEXT fields (raw)

这将返回给定24小时时段的所有日志,此时此时间约为。 3k到9k行,平均行大小为381字节,一旦删除其中一个TEXT字段(原始),将减少

3 个解决方案

#1


2  

Implement ranged partitioning by date: a) Keep monthly partitions. E.g. last 6 months b) Move anything older to archive table.

按日期实施远程分区:a)保留每月分区。例如。最近6个月b)将旧的东西移到归档表。

This is a very good idea. I gues all the writes will be in the newest partition and you will query recent data only. You always want a situation where your data and index fits in memory. So no disk i/o on reads.

这是一个很好的主意。我猜所有的写入都将在最新的分区中,您将只查询最新的数据。您总是希望数据和索引适合内存。因此读取时没有磁盘i / o。

Depending on your use case it might even be wise to have one partition per week. Then you only have to keep max two weeks of data in memory for reading the last 7 days.

根据您的使用情况,每周一个分区甚至可能是明智之举。然后,您只需要在内存中保留最多两周的数据,以便阅读过去7天。

You might also want to tune your buffer sizes (i.e. innodb_buffer_pool_size) if you are using innodb as a engine or myisam_key_cache when using myisam engine.

如果您在使用myisam引擎时使用innodb作为引擎或myisam_key_cache,您可能还想调整缓冲区大小(即innodb_buffer_pool_size)。

Also adding ram to the DB machine usually helps as the os can then have the data files in memory.

同时将ram添加到数据库机器通常会有所帮助,因为操作系统可以将数据文件存储在内存中。

If you have heavy writes you can also tune other options (i.e. how often writes are persisted to disk with innodb_log_buffer_size). This is in order to let dirty pages be in memory for longer to avoid writing them back to disk too often.

如果您有大量写入,您还可以调整其他选项(即使用innodb_log_buffer_size将写入持久保存到磁盘的频率)。这是为了让脏页在内存中的时间更长,以避免过于频繁地将它们写回磁盘。

#2


1  

For those who are curious, the following is what I used to create my partition and configure memory.

对于那些好奇的人,以下是我用来创建分区和配置内存的内容。

Creating the partitions

  1. Updated PK to include the range column used in partition

    更新了PK以包含分区中使用的范围列

    ALTER TABLE message_log 
    CHANGE COLUMN created DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    DROP PRIMARY KEY,
    ADD PRIMARY KEY (db_id, created);
    
  2. Added the partitions using ALTER TABLE.

    使用ALTER TABLE添加了分区。

In hindsight, I should have created each partition as a single ALTER statement and used Reorganize Partition (and here) on subsequent partitions as doing it in one hit consumed a lot of resources and time.

事后看来,我应该将每个分区创建为单个ALTER语句,并在后续分区上使用重组分区(以及此处),因为在一次命中中执行它会消耗大量资源和时间。

ALTER TABLE message_log 
PARTITION BY RANGE(to_days(created)) (
    partition invalid VALUES LESS THAN (0),
    partition from201607 VALUES LESS THAN (to_days('2016-08-01')),
    partition from201608 VALUES LESS THAN (to_days('2016-09-01')),
    partition from201609 VALUES LESS THAN (to_days('2016-10-01')),
    partition from201610 VALUES LESS THAN (to_days('2016-11-01')),
    partition from201611 VALUES LESS THAN (to_days('2016-12-01')),
    partition from201612 VALUES LESS THAN (to_days('2017-01-01')),
    partition from201701 VALUES LESS THAN (to_days('2017-02-01')),
    partition from201702 VALUES LESS THAN (to_days('2017-03-01')),
    partition from201703 VALUES LESS THAN (to_days('2017-04-01')),
    partition from201704 VALUES LESS THAN (to_days('2017-05-01')),
    partition future values less than (MAXVALUE) 
);

NOTE: I am not sure if using to_days() or the raw column makes much difference, but I've seen it used in most examples so I've taken it on as an assumed best practice.

注意:我不确定使用to_days()或原始列是否有很大差异,但我已经看到它在大多数示例中使用,因此我将其作为假设的最佳实践。

Setting the buffer pool size

To change the value of innodb_db_buffer_pool_size you can find info: MySQL InnoDB Buffer Pool Resize and Rick Jame's page on memory

要更改innodb_db_buffer_pool_size的值,您可以找到信息:MySQL InnoDB Buffer Pool Resize和Rick Jame的内存页面

You can also do it in MySQL Workbench in the options file menu and then the innoDB tab. Any changes you make here will be written in the config file, but you'll need to stop and start MySQL to read out the configuration, otherwise you can also set the global value to do it live.

您也可以在选项文件菜单中的MySQL Workbench中执行此操作,然后使用innoDB选项卡。您在此处所做的任何更改都将写入配置文件中,但您需要停止并启动MySQL以读取配置,否则您还可以设置全局值以进行实时更新。

#3


1  

Such a deal! I get 4 mentions, even without writing a comment or answer. I'm writing an answer because I may have some further improvements...

这样的交易!即使没有写评论或回答,我也会得到4次提及。我正在写一个答案,因为我可能会有一些进一步的改进......

Yes, PARTITION BY RANGE(TO_DAYS(...)) is the right way to go. (There may be a small number of alternatives.)

是的,按比例分区(TO_DAYS(...))是正确的方法。 (可能有少量替代品。)

70% of 4GB of RAM is tight. Be sure there is no swapping.

4GB RAM的70%是紧张的。确保没有交换。

You mentioned one query. If it is the main one of concern, then this would be slightly better:

你提到了一个查询。如果它是主要的关注点,那么这会稍微好一点:

PRIMARY KEY(device_id, created, db_id),  -- desired rows will be clustered
INDEX(db_id)  -- to keep AUTO_INCREMENT happy

If you are not purging old data, then the above key suggestion provides just as much efficiency even without partitioning.

如果您没有清除旧数据,那么即使没有分区,上述关键建议也能提供同样高的效率。

lat/lon representation says that DOUBLE is overkill.

lat / lon表示DOUBLE是矫枉过正的。

Beware of the inefficiency of UUID, especially for huge tables.

注意UUID的低效率,特别是对于大型桌子。

#1


2  

Implement ranged partitioning by date: a) Keep monthly partitions. E.g. last 6 months b) Move anything older to archive table.

按日期实施远程分区:a)保留每月分区。例如。最近6个月b)将旧的东西移到归档表。

This is a very good idea. I gues all the writes will be in the newest partition and you will query recent data only. You always want a situation where your data and index fits in memory. So no disk i/o on reads.

这是一个很好的主意。我猜所有的写入都将在最新的分区中,您将只查询最新的数据。您总是希望数据和索引适合内存。因此读取时没有磁盘i / o。

Depending on your use case it might even be wise to have one partition per week. Then you only have to keep max two weeks of data in memory for reading the last 7 days.

根据您的使用情况,每周一个分区甚至可能是明智之举。然后,您只需要在内存中保留最多两周的数据,以便阅读过去7天。

You might also want to tune your buffer sizes (i.e. innodb_buffer_pool_size) if you are using innodb as a engine or myisam_key_cache when using myisam engine.

如果您在使用myisam引擎时使用innodb作为引擎或myisam_key_cache,您可能还想调整缓冲区大小(即innodb_buffer_pool_size)。

Also adding ram to the DB machine usually helps as the os can then have the data files in memory.

同时将ram添加到数据库机器通常会有所帮助,因为操作系统可以将数据文件存储在内存中。

If you have heavy writes you can also tune other options (i.e. how often writes are persisted to disk with innodb_log_buffer_size). This is in order to let dirty pages be in memory for longer to avoid writing them back to disk too often.

如果您有大量写入,您还可以调整其他选项(即使用innodb_log_buffer_size将写入持久保存到磁盘的频率)。这是为了让脏页在内存中的时间更长,以避免过于频繁地将它们写回磁盘。

#2


1  

For those who are curious, the following is what I used to create my partition and configure memory.

对于那些好奇的人,以下是我用来创建分区和配置内存的内容。

Creating the partitions

  1. Updated PK to include the range column used in partition

    更新了PK以包含分区中使用的范围列

    ALTER TABLE message_log 
    CHANGE COLUMN created DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
    DROP PRIMARY KEY,
    ADD PRIMARY KEY (db_id, created);
    
  2. Added the partitions using ALTER TABLE.

    使用ALTER TABLE添加了分区。

In hindsight, I should have created each partition as a single ALTER statement and used Reorganize Partition (and here) on subsequent partitions as doing it in one hit consumed a lot of resources and time.

事后看来,我应该将每个分区创建为单个ALTER语句,并在后续分区上使用重组分区(以及此处),因为在一次命中中执行它会消耗大量资源和时间。

ALTER TABLE message_log 
PARTITION BY RANGE(to_days(created)) (
    partition invalid VALUES LESS THAN (0),
    partition from201607 VALUES LESS THAN (to_days('2016-08-01')),
    partition from201608 VALUES LESS THAN (to_days('2016-09-01')),
    partition from201609 VALUES LESS THAN (to_days('2016-10-01')),
    partition from201610 VALUES LESS THAN (to_days('2016-11-01')),
    partition from201611 VALUES LESS THAN (to_days('2016-12-01')),
    partition from201612 VALUES LESS THAN (to_days('2017-01-01')),
    partition from201701 VALUES LESS THAN (to_days('2017-02-01')),
    partition from201702 VALUES LESS THAN (to_days('2017-03-01')),
    partition from201703 VALUES LESS THAN (to_days('2017-04-01')),
    partition from201704 VALUES LESS THAN (to_days('2017-05-01')),
    partition future values less than (MAXVALUE) 
);

NOTE: I am not sure if using to_days() or the raw column makes much difference, but I've seen it used in most examples so I've taken it on as an assumed best practice.

注意:我不确定使用to_days()或原始列是否有很大差异,但我已经看到它在大多数示例中使用,因此我将其作为假设的最佳实践。

Setting the buffer pool size

To change the value of innodb_db_buffer_pool_size you can find info: MySQL InnoDB Buffer Pool Resize and Rick Jame's page on memory

要更改innodb_db_buffer_pool_size的值,您可以找到信息:MySQL InnoDB Buffer Pool Resize和Rick Jame的内存页面

You can also do it in MySQL Workbench in the options file menu and then the innoDB tab. Any changes you make here will be written in the config file, but you'll need to stop and start MySQL to read out the configuration, otherwise you can also set the global value to do it live.

您也可以在选项文件菜单中的MySQL Workbench中执行此操作,然后使用innoDB选项卡。您在此处所做的任何更改都将写入配置文件中,但您需要停止并启动MySQL以读取配置,否则您还可以设置全局值以进行实时更新。

#3


1  

Such a deal! I get 4 mentions, even without writing a comment or answer. I'm writing an answer because I may have some further improvements...

这样的交易!即使没有写评论或回答,我也会得到4次提及。我正在写一个答案,因为我可能会有一些进一步的改进......

Yes, PARTITION BY RANGE(TO_DAYS(...)) is the right way to go. (There may be a small number of alternatives.)

是的,按比例分区(TO_DAYS(...))是正确的方法。 (可能有少量替代品。)

70% of 4GB of RAM is tight. Be sure there is no swapping.

4GB RAM的70%是紧张的。确保没有交换。

You mentioned one query. If it is the main one of concern, then this would be slightly better:

你提到了一个查询。如果它是主要的关注点,那么这会稍微好一点:

PRIMARY KEY(device_id, created, db_id),  -- desired rows will be clustered
INDEX(db_id)  -- to keep AUTO_INCREMENT happy

If you are not purging old data, then the above key suggestion provides just as much efficiency even without partitioning.

如果您没有清除旧数据,那么即使没有分区,上述关键建议也能提供同样高的效率。

lat/lon representation says that DOUBLE is overkill.

lat / lon表示DOUBLE是矫枉过正的。

Beware of the inefficiency of UUID, especially for huge tables.

注意UUID的低效率,特别是对于大型桌子。