在具有300万行的PostgreSQL数据库上缓慢地简单更新查询

时间:2023-02-03 23:06:26

I am trying a simple UPDATE table SET column1 = 0 on a table with ~3 million rows on Postegres 8.4 but it is taking forever to finish. It has been running for more than 10 min. now in my last attempt.

我正在尝试一个简单的更新表集column1 = 0,在postegres8.4上有大约300万行,但是这需要很长时间才能完成。它已经跑了十多分钟了,现在是我最后一次尝试。

Before, I tried to run a VACUUM and ANALYZE commands on that table and I also tried to create some indexes (although I doubt this will make any difference in this case) but none seems to help.

在此之前,我尝试在该表上运行真空并分析命令,还尝试创建一些索引(尽管我怀疑在这种情况下这是否会有什么不同),但似乎没有任何一个有用的方法。

Any other ideas?

任何其他想法?

Thanks, Ricardo

谢谢,里卡多

Update:

更新:

This is the table structure:

这是表格结构:

CREATE TABLE myTable
(
  id bigserial NOT NULL,
  title text,
  description text,
  link text,
  "type" character varying(255),
  generalFreq real,
  generalWeight real,
  author_id bigint,
  status_id bigint,
  CONSTRAINT resources_pkey PRIMARY KEY (id),
  CONSTRAINT author_pkey FOREIGN KEY (author_id)
      REFERENCES users (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT c_unique_status_id UNIQUE (status_id)
);

I am trying to run UPDATE myTable SET generalFreq = 0;

我正在尝试运行更新myTable集合generalFreq = 0;

8 个解决方案

#1


10  

Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates

看看这个答案:PostgreSQL在一个具有数组和大量更新的大表上运行缓慢

First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:

首先从更好的FILLFACTOR开始,做一个真空填充来强制表重写,并在更新查询之后检查热更新:

SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';

HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.

当您有许多记录要更新时,热更新要快得多。关于HOT的更多信息可以在本文中找到。

Ps. You need version 8.3 or better.

您需要8.3版本或更好的版本。

#2


23  

I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%). My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.

我必须更新表1或20亿行,每一行有不同的值。每次运行产生大约1亿的变化(10%)。我的第一个尝试是直接在特定分区上对300K更新进行分组,因为如果使用分区,Postgresql并不总是优化已准备的查询。

  1. Transactions of bunch of "UPDATE myTable SET myField=value WHERE myId=id"
    Gives 1,500 updates/sec. which means each run would take at least 18 hours.
  2. “UPDATE myTable SET myField=value,其中myId=id”的事务数为每秒1,500次更新。这意味着每次跑步至少需要18个小时。
  3. HOT updates solution as described here with FILLFACTOR=50. Gives 1,600 updates/sec. I uses SSD's so it's a costly improvement as it doubles the storage size.
  4. 用FILLFACTOR=50描述的热点更新解决方案。给1600更新/秒。我使用SSD,所以这是一个代价高昂的改进,因为它使存储空间增加了一倍。
  5. Insert in a temporary table of updated value and merge them after with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM for each partition; 100,000 up/s otherwise. Cooool.
    Here is the sequence of operations:
  6. 插入一个更新值的临时表,并在更新后合并它们……从18000年给更新/秒。如果我为每个分区做一个真空;否则100000 / s。Cooool。以下是操作顺序:

CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));

Accumulate a bunch of updates in a buffer depending of available RAM When it's filled, or need to change of table/partition, or completed:

根据已填充的可用RAM,或需要更改表/分区或已完成的RAM,在缓冲区中累积一系列更新:

COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;

That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included.

这意味着运行现在需要1.5小时,而不是18h进行1亿次更新(包括真空度)。

#3


7  

After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:

在等待了35分钟后,为了完成更新查询(仍然没有完成),我决定尝试一些不同的东西。所以我做的是一个命令:

CREATE TABLE table2 AS 
SELECT 
  all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable

Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)

然后添加索引,然后删除旧表,重命名新表。这只需要1.7分钟就可以处理,还需要额外的时间来重新创建索引和约束。但它确实帮助!:)

Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.

当然,这样做是有效的,因为没有其他人在使用数据库。如果这是在生产环境中,我需要先锁定表。

#4


3  

Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.

今天我花了很多时间来处理类似的问题。我找到了一个解决方案:在更新之前删除所有的约束/索引。无论被更新的列是否被索引,看起来psql都会更新所有更新行的所有索引。更新完成后,将约束/索引添加回。

#5


2  

Try this (note that generalFreq starts as type REAL, and stays the same):

试试这个(注意generalFreq从type REAL开始,并且保持不变):

ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;

This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.

这将重写表,类似于DROP + CREATE,并重新生成所有索引。但所有的一切都集中在一个命令上。更快(大约2x),您不必处理依赖关系和重新创建索引和其他东西,尽管它确实锁定了表(访问独占—即。全锁)持续时间。或者,如果你想让其他所有东西都排在它后面,那么这就是你想要的。如果您不更新“太多”行,那么这种方式比更新慢。

#6


0  

How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.

你是怎么运作的?如果您正在循环每一行并执行一个更新语句,那么您将运行数百万个可能的更新,这就是为什么它将执行得非常慢的原因。

If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.

如果您在一条语句中为所有记录运行一条更新语句,那么它将运行得更快,如果这个进程很慢,那么它很可能更接近于您的硬件。300万是很多记录。

#7


0  

The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:

我建议的第一件事(来自https://dba.stackexchange.com/questions/118178 8/does- update- a-row-with- The -same-value- actu- update- The -row)是只更新“需要”它的行,例如:

 UPDATE myTable SET generalFreq = 0 where generalFreq != 0;

(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.

(可能还需要一个关于generalFreq的索引)。然后您将更新更少的行。虽然不是所有的值都是非零的,但是更新更少的行“可以帮助”,因为不管值是否更改,它都会更新它们和所有的索引。

Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.

另一个选项:如果星号按照默认值和非空值约束对齐,您可以删除旧列,通过调整元数据(即即时时间)创建另一个列。

#8


-2  

try

试一试

UPDATE myTable SET generalFreq = 0.0;

Maybe it is a casting issue

也许是选角的问题

#1


10  

Take a look at this answer: PostgreSQL slow on a large table with arrays and lots of updates

看看这个答案:PostgreSQL在一个具有数组和大量更新的大表上运行缓慢

First start with a better FILLFACTOR, do a VACUUM FULL to force table rewrite and check the HOT-updates after your UPDATE-query:

首先从更好的FILLFACTOR开始,做一个真空填充来强制表重写,并在更新查询之后检查热更新:

SELECT n_tup_hot_upd, * FROM pg_stat_user_tables WHERE relname = 'myTable';

HOT updates are much faster when you have a lot of records to update. More information about HOT can be found in this article.

当您有许多记录要更新时,热更新要快得多。关于HOT的更多信息可以在本文中找到。

Ps. You need version 8.3 or better.

您需要8.3版本或更好的版本。

#2


23  

I have to update tables of 1 or 2 billion rows with various values for each rows. Each run makes ~100 millions changes (10%). My first try was to group them in transaction of 300K updates directly on a specific partition as Postgresql not always optimize prepared queries if you use partitions.

我必须更新表1或20亿行,每一行有不同的值。每次运行产生大约1亿的变化(10%)。我的第一个尝试是直接在特定分区上对300K更新进行分组,因为如果使用分区,Postgresql并不总是优化已准备的查询。

  1. Transactions of bunch of "UPDATE myTable SET myField=value WHERE myId=id"
    Gives 1,500 updates/sec. which means each run would take at least 18 hours.
  2. “UPDATE myTable SET myField=value,其中myId=id”的事务数为每秒1,500次更新。这意味着每次跑步至少需要18个小时。
  3. HOT updates solution as described here with FILLFACTOR=50. Gives 1,600 updates/sec. I uses SSD's so it's a costly improvement as it doubles the storage size.
  4. 用FILLFACTOR=50描述的热点更新解决方案。给1600更新/秒。我使用SSD,所以这是一个代价高昂的改进,因为它使存储空间增加了一倍。
  5. Insert in a temporary table of updated value and merge them after with UPDATE...FROM Gives 18,000 updates/sec. if I do a VACUUM for each partition; 100,000 up/s otherwise. Cooool.
    Here is the sequence of operations:
  6. 插入一个更新值的临时表,并在更新后合并它们……从18000年给更新/秒。如果我为每个分区做一个真空;否则100000 / s。Cooool。以下是操作顺序:

CREATE TEMP TABLE tempTable (id BIGINT NOT NULL, field(s) to be updated,
CONSTRAINT tempTable_pkey PRIMARY KEY (id));

Accumulate a bunch of updates in a buffer depending of available RAM When it's filled, or need to change of table/partition, or completed:

根据已填充的可用RAM,或需要更改表/分区或已完成的RAM,在缓冲区中累积一系列更新:

COPY tempTable FROM buffer;
UPDATE myTable a SET field(s)=value(s) FROM tempTable b WHERE a.id=b.id;
COMMIT;
TRUNCATE TABLE tempTable;
VACUUM FULL ANALYZE myTable;

That means a run now takes 1.5h instead of 18h for 100 millions updates, vacuum included.

这意味着运行现在需要1.5小时,而不是18h进行1亿次更新(包括真空度)。

#3


7  

After waiting 35 min. for my UPDATE query to finish (and still didn't) I decided to try something different. So what I did was a command:

在等待了35分钟后,为了完成更新查询(仍然没有完成),我决定尝试一些不同的东西。所以我做的是一个命令:

CREATE TABLE table2 AS 
SELECT 
  all the fields of table1 except the one I wanted to update, 0 as theFieldToUpdate
from myTable

Then add indexes, then drop the old table and rename the new one to take its place. That took only 1.7 min. to process plus some extra time to recreate the indexes and constraints. But it did help! :)

然后添加索引,然后删除旧表,重命名新表。这只需要1.7分钟就可以处理,还需要额外的时间来重新创建索引和约束。但它确实帮助!:)

Of course that did work only because nobody else was using the database. I would need to lock the table first if this was in a production environment.

当然,这样做是有效的,因为没有其他人在使用数据库。如果这是在生产环境中,我需要先锁定表。

#4


3  

Today I've spent many hours with similar issue. I've found a solution: drop all the constraints/indices before the update. No matter whether the column being updated is indexed or not, it seems like psql updates all the indices for all the updated rows. After the update is finished, add the constraints/indices back.

今天我花了很多时间来处理类似的问题。我找到了一个解决方案:在更新之前删除所有的约束/索引。无论被更新的列是否被索引,看起来psql都会更新所有更新行的所有索引。更新完成后,将约束/索引添加回。

#5


2  

Try this (note that generalFreq starts as type REAL, and stays the same):

试试这个(注意generalFreq从type REAL开始,并且保持不变):

ALTER TABLE myTable ALTER COLUMN generalFreq TYPE REAL USING 0;

This will rewrite the table, similar to a DROP + CREATE, and rebuild all indices. But all in one command. Much faster (about 2x) and you don't have to deal with dependencies and recreating indexes and other stuff, though it does lock the table (access exclusive--i.e. full lock) for the duration. Or maybe that's what you want if you want everything else to queue up behind it. If you aren't updating "too many" rows this way is slower than just an update.

这将重写表,类似于DROP + CREATE,并重新生成所有索引。但所有的一切都集中在一个命令上。更快(大约2x),您不必处理依赖关系和重新创建索引和其他东西,尽管它确实锁定了表(访问独占—即。全锁)持续时间。或者,如果你想让其他所有东西都排在它后面,那么这就是你想要的。如果您不更新“太多”行,那么这种方式比更新慢。

#6


0  

How are you running it? If you are looping each row and performing an update statement, you are running potentially millions of individual updates which is why it will perform incredibly slowly.

你是怎么运作的?如果您正在循环每一行并执行一个更新语句,那么您将运行数百万个可能的更新,这就是为什么它将执行得非常慢的原因。

If you are running a single update statement for all records in one statement it would run a lot faster, and if this process is slow then it's probably down to your hardware more than anything else. 3 million is a lot of records.

如果您在一条语句中为所有记录运行一条更新语句,那么它将运行得更快,如果这个进程很慢,那么它很可能更接近于您的硬件。300万是很多记录。

#7


0  

The first thing I'd suggest (from https://dba.stackexchange.com/questions/118178/does-updating-a-row-with-the-same-value-actually-update-the-row) is to only update rows that "need" it, ex:

我建议的第一件事(来自https://dba.stackexchange.com/questions/118178 8/does- update- a-row-with- The -same-value- actu- update- The -row)是只更新“需要”它的行,例如:

 UPDATE myTable SET generalFreq = 0 where generalFreq != 0;

(might also need an index on generalFreq). Then you'll update fewer rows. Though not if the values are all non zero already, but updating fewer rows "can help" since otherwise it updates them and all indexes regardless of whether the value changed or not.

(可能还需要一个关于generalFreq的索引)。然后您将更新更少的行。虽然不是所有的值都是非零的,但是更新更少的行“可以帮助”,因为不管值是否更改,它都会更新它们和所有的索引。

Another option: if the stars align in terms of defaults and not-null constraints, you can drop the old column and create another by just adjusting metadata, instant time.

另一个选项:如果星号按照默认值和非空值约束对齐,您可以删除旧列,通过调整元数据(即即时时间)创建另一个列。

#8


-2  

try

试一试

UPDATE myTable SET generalFreq = 0.0;

Maybe it is a casting issue

也许是选角的问题