
时间:2021-09-09 05:57:52

I have a fairly large InnoDB table which contains about 10 million rows (and counting, it is expected to become 20 times that size). Each row is not that large (131 B on average), but from time to time I have to delete a chunk of them, and that is taking ages. This is the table structure:

我有一个相当大的InnoDB表,它包含大约1000万行(并且计数,它的大小预期是这个的20倍)。每一行都没有那么大(平均131 B),但我时不时地要删除其中的一部分,这需要很长的时间。这是表格结构:

 CREATE TABLE `problematic_table` (
    `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
    `taxid` int(10) unsigned NOT NULL,
    `blastdb_path` varchar(255) NOT NULL,
    `query` char(32) NOT NULL,
    `target` int(10) unsigned NOT NULL,
    `score` double NOT NULL,
    `evalue` varchar(100) NOT NULL,
    `log_evalue` double NOT NULL DEFAULT '-999',
    `start` int(10) unsigned DEFAULT NULL,
    `end` int(10) unsigned DEFAULT NULL,
    PRIMARY KEY (`id`),
    KEY `taxid` (`taxid`),
    KEY `query` (`query`),
    KEY `target` (`target`),
    KEY `log_evalue` (`log_evalue`)

Queries that delete large chunks from the table are simply like this:


DELETE FROM problematic_table WHERE problematic_table.taxid = '57';

A query like this just took almost an hour to finish. I can imagine that the index rewriting overhead makes these queries very slow.


I am developing an application that will run on pre-existing databases. I most likely have no control over server variables unless I make changes to them mandatory (which I would prefer not to), so I'm afraid suggestions that change those are of little value.


I have tried to INSERT ... SELECT those rows that I don't want to delete into a temporary table and just dropping the rest, but as the ratio of to-delete vs. to-keep shifts towards to-keep, this is no longer a useful solution.


This is a table that may see frequent INSERTs and SELECTs in the future, but no UPDATEs. Basically, it's a logging and reference table that needs to drop parts of its content from time to time.


Could I improve my indexes on this table by limiting their length? Would switching to MyISAM help, which supports DISABLE KEYS during transactions? What else could I try to improve DELETE performance?


Edit: One such deletion would be in the order of about one million of rows.


3 个解决方案



This solution can provide better performance once completed, but the process may take some time to implement.


A new BIT column can be added and defaulted to TRUE for "active" and FALSE for "inactive". If that's not enough states, you could use TINYINT with 256 possible values.


Adding this new column will probably take a long time, but once it's over, your updates should be much faster as long as you do it off the PRIMARY as you do with your deletes and don't index this new column.


The reason why InnoDB takes so long to DELETE on such a massive table as yours is because of the cluster index. It physically orders your table based upon your PRIMARY, first UNIQUE it finds, or whatever it can determine as an adequate substitute if it can't find PRIMARY or UNIQUE, so when one row is deleted, it now reorders your entire table physically on the disk for speed and defragmentation. So it's not the DELETE that's taking so long; it's the physical reordering after that row is removed.


When you create a fixed width column and update that instead of deleting, there's no need for physical reordering across your huge table because the space consumed by a row and table itself is constant.


During off hours, a single DELETE can be used to remove the unnecessary rows. This operation will still be slow but collectively much faster than deleting individual rows.




I had a similar scenario with a table with 2 million rows and a delete statement, which should delete around a 100 thousand rows - it took around 10 minutes to do so.


After I checked the configuration, I found that MySQL Server was running with default innodb_buffer_pool_size = 8 MB (!).

检查配置后,我发现MySQL服务器使用默认的innodb_buffer_pool_size = 8 MB(!)运行。

After restart with innodb_buffer_pool_size = 1.5GB, the same scenario took 10 sec.

在使用innodb_buffer_pool_size = 1.5GB重新启动之后,相同的场景需要10秒。

So it looks like there is a dependency if "reordering of the table" can fit in buffer_pool or not.




I solved a similar problem by using a stored procedure, thereby improving performance by a factor of several thousand.


My table had 33M rows and several indexes and I wanted to delete 10K rows. My DB was in Azure with no control over innodb_buffer_pool_size.


For simplicity I created a table tmp_id with only a primary id field:


CREATE TABLE `tmp_id` (
    `id` bigint(20) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`)

I selected the set of ids I wanted to delete into tmp_id and ran delete from my_table where id in (select id from tmp_id); This did not complete in 12 hours, so I tried with only a single id in tmp_id and it took 25 minutes. Doing delete from my_table where id = 1234 completed in a few milliseconds, so I decided to try doing that in a procedure instead:

我选择要删除到tmp_id的id集合,并从id所在的my_table中运行delete(从tmp_id中选择id);这在12小时内没有完成,所以我在tmp_id中只尝试了一个id,花费了25分钟。在my_table中执行delete,其中id = 1234在几毫秒内完成,所以我决定在一个过程中尝试这样做:

CREATE PROCEDURE `delete_ids_in_tmp`()
    declare finished integer default 0;
    declare v_id bigint(20);
    declare cur1 cursor for select id from tmp_id;
    declare continue handler for not found set finished=1;    
    open cur1;
    igmLoop: loop
        fetch cur1 into v_id;
        if finished = 1 then leave igmLoop; end if;
        delete from problematic_table where id = v_id;
    end loop igmLoop;
    close cur1;

Now call delete_ids_in_tmp(); deleted all 10K rows in less than a minute.




This solution can provide better performance once completed, but the process may take some time to implement.


A new BIT column can be added and defaulted to TRUE for "active" and FALSE for "inactive". If that's not enough states, you could use TINYINT with 256 possible values.


Adding this new column will probably take a long time, but once it's over, your updates should be much faster as long as you do it off the PRIMARY as you do with your deletes and don't index this new column.


The reason why InnoDB takes so long to DELETE on such a massive table as yours is because of the cluster index. It physically orders your table based upon your PRIMARY, first UNIQUE it finds, or whatever it can determine as an adequate substitute if it can't find PRIMARY or UNIQUE, so when one row is deleted, it now reorders your entire table physically on the disk for speed and defragmentation. So it's not the DELETE that's taking so long; it's the physical reordering after that row is removed.


When you create a fixed width column and update that instead of deleting, there's no need for physical reordering across your huge table because the space consumed by a row and table itself is constant.


During off hours, a single DELETE can be used to remove the unnecessary rows. This operation will still be slow but collectively much faster than deleting individual rows.




I had a similar scenario with a table with 2 million rows and a delete statement, which should delete around a 100 thousand rows - it took around 10 minutes to do so.


After I checked the configuration, I found that MySQL Server was running with default innodb_buffer_pool_size = 8 MB (!).

检查配置后,我发现MySQL服务器使用默认的innodb_buffer_pool_size = 8 MB(!)运行。

After restart with innodb_buffer_pool_size = 1.5GB, the same scenario took 10 sec.

在使用innodb_buffer_pool_size = 1.5GB重新启动之后,相同的场景需要10秒。

So it looks like there is a dependency if "reordering of the table" can fit in buffer_pool or not.




I solved a similar problem by using a stored procedure, thereby improving performance by a factor of several thousand.


My table had 33M rows and several indexes and I wanted to delete 10K rows. My DB was in Azure with no control over innodb_buffer_pool_size.


For simplicity I created a table tmp_id with only a primary id field:


CREATE TABLE `tmp_id` (
    `id` bigint(20) NOT NULL DEFAULT '0',
    PRIMARY KEY (`id`)

I selected the set of ids I wanted to delete into tmp_id and ran delete from my_table where id in (select id from tmp_id); This did not complete in 12 hours, so I tried with only a single id in tmp_id and it took 25 minutes. Doing delete from my_table where id = 1234 completed in a few milliseconds, so I decided to try doing that in a procedure instead:

我选择要删除到tmp_id的id集合,并从id所在的my_table中运行delete(从tmp_id中选择id);这在12小时内没有完成,所以我在tmp_id中只尝试了一个id,花费了25分钟。在my_table中执行delete,其中id = 1234在几毫秒内完成,所以我决定在一个过程中尝试这样做:

CREATE PROCEDURE `delete_ids_in_tmp`()
    declare finished integer default 0;
    declare v_id bigint(20);
    declare cur1 cursor for select id from tmp_id;
    declare continue handler for not found set finished=1;    
    open cur1;
    igmLoop: loop
        fetch cur1 into v_id;
        if finished = 1 then leave igmLoop; end if;
        delete from problematic_table where id = v_id;
    end loop igmLoop;
    close cur1;

Now call delete_ids_in_tmp(); deleted all 10K rows in less than a minute.
