从另一个大mysql db表中更新列(700万行)

时间:2022-10-06 12:42:17

Description

I have 2 tables with the following structure (irrelevant columns removed):

我有两个表,结构如下(不相关的列被删除):

mysql> explain parts;
+-------------+--------------+------+-----+---------+-------+
| Field       | Type         | Null | Key | Default | Extra |
+-------------+--------------+------+-----+---------+-------+
| code        | varchar(32)  | NO   | PRI | NULL    |       |
| slug        | varchar(255) | YES  |     | NULL    |       |
| title       | varchar(64)  | YES  |     | NULL    |       |
+-------------+--------------+------+-----+---------+-------+
4 rows in set (0.00 sec)

and

mysql> explain details;
+-------------------+--------------+------+-----+---------+-------+
| Field             | Type         | Null | Key | Default | Extra |
+-------------------+--------------+------+-----+---------+-------+
| sku               | varchar(32)  | NO   | PRI | NULL    |       |
| description       | varchar(700) | YES  |     | NULL    |       |
| part_code         | varchar(32)  | NO   | PRI |         |       |
+-------------------+--------------+------+-----+---------+-------+
3 rows in set (0.00 sec)

Table parts contains 184147 rows, and details contains 7278870 rows. The part_code column from details represents the code column from the parts table. Since these columns are varchar, I want to add the column id int(11) to parts, and part_id int(11) to details. I tried this:

表部件包含184147行,细节包含7278870行。details中的part_code列表示parts表中的代码列。由于这些列是varchar,所以我想将列id int(11)添加到part中,并将part_id int(11)添加到details中。我试着这样的:

mysql> alter table parts drop primary key;
Query OK, 184147 rows affected (0.66 sec)
Records: 184147  Duplicates: 0  Warnings: 0

mysql> alter table parts add column
       id int(11) not null auto_increment primary key first;
Query OK, 184147 rows affected (0.55 sec)
Records: 184147  Duplicates: 0  Warnings: 0

mysql> select id, code from parts limit 5;
+----+-------------------------+
| id | code                    |
+----+-------------------------+
|  1 | Yhk0KqSMeLcfH1KEfykihQ2 |
|  2 | IMl4iweZdmrBGvSUCtMCJA2 |
|  3 | rAKZUDj1WOnbkX_8S8mNbw2 |
|  4 | rV09rJ3X33-MPiNRcPTAwA2 |
|  5 | LPyIa_M_TOZ8655u1Ls5mA2 |
+----+-------------------------+
5 rows in set (0.00 sec)

So now I have the id column with correct data in parts table. After adding part_id column to details table:

现在我有了id列,在part表中有正确的数据。将part_id列添加到details表后:

mysql> alter table details add column part_id int(11) not null after part_code;
Query OK, 7278870 rows affected (1 min 17.74 sec)
Records: 7278870  Duplicates: 0  Warnings: 0

Now the big problem is how to update part_id accordingly? The following query:

现在最大的问题是如何相应地更新part_id ?以下查询:

mysql> update details d
       join parts p on d.part_code = p.code
       set d.part_id = p.id;

was running for about 30 hours until I killed it.

我跑了大约30个小时直到我杀了它。

Note that both tables are MyISAM:

注意,这两个表都是MyISAM:

mysql> select engine from information_schema.tables where table_schema = 'db_name' and (table_name = 'parts' or table_name = 'details');
+--------+
| ENGINE |
+--------+
| MyISAM |
| MyISAM |
+--------+
2 rows in set (0.01 sec)

I just now realized that one of the problems was that dropping the key on parts table I dropped the index on the code column. On the other side, I have the following indexes on details table (some irrelevant columns are omitted):

我刚刚意识到其中的一个问题是我把关键字掉在了零件表上,我把索引掉在了代码列上。另一方面,我在details表上有以下索引(省略了一些无关的列):

mysql> show indexes from details;
+---------+------------+----------+--------------+-------------+-----------+-------------+------------+
| Table   | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Index_type |
+---------+------------+----------+--------------+-------------+-----------+-------------+------------+
| details |          0 | PRIMARY  |            1 | sku         | A         |        NULL | BTREE      |
| details |          0 | PRIMARY  |            3 | part_code   | A         |     7278870 | BTREE      |
+---------+------------+----------+--------------+-------------+-----------+-------------+------------+
2 rows in set (0.00 sec)

My questions are:

  1. Is the update query OK or it can be optimized somehow?
  2. 更新查询是否可以,或者可以对其进行优化?
  3. I will add the index on the code column in parts table, will the query run in a reasonable time, or it will run for days again?
  4. 我将在parts表的code列中添加索引,查询会在合理的时间内运行,还是会再运行几天?
  5. How can I make a (sql/bash/php) script so I can see the progress of the query execution?
  6. 如何创建一个(sql/bash/php)脚本,以便查看查询执行的进度?

Thank you very much!

非常感谢!

3 个解决方案

#1


3  

As I mentioned in the question, I forgot about the dropped indexes on the parts table, so I added them:

正如我在问题中提到的,我忘记了零件表上掉下来的索引,所以我添加了:

alter table parts add key code (code);

Inspired by Puggan Se's answer, I tried to use a LIMIT on UPDATE in a PHP script, but LIMIT can't be used with an UPDATE with JOIN in MySQL. To limit the query I added a new column to the details table:

受到Puggan Se答案的启发,我尝试在PHP脚本中对更新使用限制,但是限制不能用于与MySQL中的JOIN进行更新。为了限制查询,我在details表中添加了一个新列:

# drop the primary key,
alter table details drop primary key;
# so I can create an auto_increment column
alter table details add id int not null auto_increment primary key;
# alter the id column and remove the auto_increment
alter table details change id id int not null;
# drop again the primary key
alter table details drop primary key;
# add new indexes
alter table details add primary key ( id, sku, num, part_code );

Now I can use the "limit":

现在我可以用“极限”了:

update details d
join parts p on d.part_code = p.code
set d.part_id = p.id
where d.id between 1 and 5000;

So here's the full PHP script:

下面是完整的PHP脚本:

$started = time();
$i = 0;
$total = 7278870;

echo "Started at " . date('H:i:s', $started) . PHP_EOL;

function timef($s){
    $h = round($s / 3600);
    $h = str_pad($h, 2, '0', STR_PAD_LEFT);
    $s = $s % 3600;
    $m = round( $s / 60);
    $m = str_pad($m, 2, '0', STR_PAD_LEFT);
    $s = $s % 60;
    $s = str_pad($s, 2, '0', STR_PAD_LEFT);
    return "$h:$m:$s";
}

while (1){
    $i++;
    $j = $i * 5000;
    $k = $j + 4999;
    $result = mysql_query("
        update details d
        join parts p on d.part_code = p.code
        set d.part_id = p.id
        where d.id between $j and $k
    ");
    if(!$result) die(mysql_error());
    if(mysql_affected_rows() == 0) die(PHP_EOL . 'Done!');
    $p = round(($i * 5000) / $total, 4) * 100;
    $s = time() - $started;
    $ela = timef($s);
    $eta = timef( (( $s / $p ) * 100) - $s );
    $eq = floor($p/10);
    $show_gt = ($p == 100);
    $spaces = $show_gt ? 9 - $eq : 10 - $eq;
    echo "\r {$p}% | [" . str_repeat('=', $eq) . ( $show_gt ? '' : '>' ) . str_repeat(' ', $spaces) . "] | Elapsed: ${ela} | ETA: ${eta}";
}

And here's a screenshot:

这是一个截图:

从另一个大mysql db表中更新列(700万行)

As you can see, the whole thing took less than 5 minutes :) Thank you all!

正如你所看到的,整个过程只用了不到5分钟:)谢谢大家!

P.S.: There's still a bug because I found later 4999 rows left with part_id = 0, but I did that manually already.

注::仍然有一个bug,因为我后来找到了part_id = 0的4999行,但是我已经手动完成了。

#2


1  

  1. You may want to add a where and a limit, so you can update it in chunks

    您可能想要添加一个where和一个limit,以便您可以以块的形式更新它

    update details d
    join parts p on d.part_code = p.code
    set d.part_id = p.id
    WHERE d.part_id =0
    LIMIT 5000;
    
  2. it will be alot faster whit index, and if you do one query as sugesten in '1' above, you can se how long 5000 rows takes to handle

    它将比whit索引快得多,如果您在上面的“1”中做一个sugesten查询,您可以知道要处理5000行需要多长时间

  3. loop above query

    以上循环查询

    while(TRUE)
    {
        $result = mysql_query($query);
        if(!$result) die('Failed: ' . mysql_error());
        if(mysql_affected_rows() == 0) die('Done');
        echo '.';
    }
    

EDIT 1 rewrite the query do to limit error on joins

编辑1重写查询do以限制连接上的错误

You can use a subquery to avoid the multiple tables update:

可以使用子查询来避免多个表的更新:

UPDATE details
SET part_id = (SELECT id FROM parts WHERE parts.code = details.part_code)
WHERE part_id = 0
LIMIT 5000;

#3


0  

You can try to remove the indexes form the table you're trying to update. MySQL recreates the indexes on each row update. It won't be blazing fast for 7M records.

您可以尝试从要更新的表中删除索引。MySQL在每一行更新中重新创建索引。对于700万项记录来说,这不会很快。

#1


3  

As I mentioned in the question, I forgot about the dropped indexes on the parts table, so I added them:

正如我在问题中提到的,我忘记了零件表上掉下来的索引,所以我添加了:

alter table parts add key code (code);

Inspired by Puggan Se's answer, I tried to use a LIMIT on UPDATE in a PHP script, but LIMIT can't be used with an UPDATE with JOIN in MySQL. To limit the query I added a new column to the details table:

受到Puggan Se答案的启发,我尝试在PHP脚本中对更新使用限制,但是限制不能用于与MySQL中的JOIN进行更新。为了限制查询,我在details表中添加了一个新列:

# drop the primary key,
alter table details drop primary key;
# so I can create an auto_increment column
alter table details add id int not null auto_increment primary key;
# alter the id column and remove the auto_increment
alter table details change id id int not null;
# drop again the primary key
alter table details drop primary key;
# add new indexes
alter table details add primary key ( id, sku, num, part_code );

Now I can use the "limit":

现在我可以用“极限”了:

update details d
join parts p on d.part_code = p.code
set d.part_id = p.id
where d.id between 1 and 5000;

So here's the full PHP script:

下面是完整的PHP脚本:

$started = time();
$i = 0;
$total = 7278870;

echo "Started at " . date('H:i:s', $started) . PHP_EOL;

function timef($s){
    $h = round($s / 3600);
    $h = str_pad($h, 2, '0', STR_PAD_LEFT);
    $s = $s % 3600;
    $m = round( $s / 60);
    $m = str_pad($m, 2, '0', STR_PAD_LEFT);
    $s = $s % 60;
    $s = str_pad($s, 2, '0', STR_PAD_LEFT);
    return "$h:$m:$s";
}

while (1){
    $i++;
    $j = $i * 5000;
    $k = $j + 4999;
    $result = mysql_query("
        update details d
        join parts p on d.part_code = p.code
        set d.part_id = p.id
        where d.id between $j and $k
    ");
    if(!$result) die(mysql_error());
    if(mysql_affected_rows() == 0) die(PHP_EOL . 'Done!');
    $p = round(($i * 5000) / $total, 4) * 100;
    $s = time() - $started;
    $ela = timef($s);
    $eta = timef( (( $s / $p ) * 100) - $s );
    $eq = floor($p/10);
    $show_gt = ($p == 100);
    $spaces = $show_gt ? 9 - $eq : 10 - $eq;
    echo "\r {$p}% | [" . str_repeat('=', $eq) . ( $show_gt ? '' : '>' ) . str_repeat(' ', $spaces) . "] | Elapsed: ${ela} | ETA: ${eta}";
}

And here's a screenshot:

这是一个截图:

从另一个大mysql db表中更新列(700万行)

As you can see, the whole thing took less than 5 minutes :) Thank you all!

正如你所看到的,整个过程只用了不到5分钟:)谢谢大家!

P.S.: There's still a bug because I found later 4999 rows left with part_id = 0, but I did that manually already.

注::仍然有一个bug,因为我后来找到了part_id = 0的4999行,但是我已经手动完成了。

#2


1  

  1. You may want to add a where and a limit, so you can update it in chunks

    您可能想要添加一个where和一个limit,以便您可以以块的形式更新它

    update details d
    join parts p on d.part_code = p.code
    set d.part_id = p.id
    WHERE d.part_id =0
    LIMIT 5000;
    
  2. it will be alot faster whit index, and if you do one query as sugesten in '1' above, you can se how long 5000 rows takes to handle

    它将比whit索引快得多,如果您在上面的“1”中做一个sugesten查询,您可以知道要处理5000行需要多长时间

  3. loop above query

    以上循环查询

    while(TRUE)
    {
        $result = mysql_query($query);
        if(!$result) die('Failed: ' . mysql_error());
        if(mysql_affected_rows() == 0) die('Done');
        echo '.';
    }
    

EDIT 1 rewrite the query do to limit error on joins

编辑1重写查询do以限制连接上的错误

You can use a subquery to avoid the multiple tables update:

可以使用子查询来避免多个表的更新:

UPDATE details
SET part_id = (SELECT id FROM parts WHERE parts.code = details.part_code)
WHERE part_id = 0
LIMIT 5000;

#3


0  

You can try to remove the indexes form the table you're trying to update. MySQL recreates the indexes on each row update. It won't be blazing fast for 7M records.

您可以尝试从要更新的表中删除索引。MySQL在每一行更新中重新创建索引。对于700万项记录来说,这不会很快。