为什么丢失的主键/唯一键会导致更新服务器上的死锁问题?

I came across a schema and an upsert stored procedure that was causing deadlock issues. I have a general idea about why this is causing deadlock and how to fix it. I can reproduce it but I don't have a clear understanding of the sequence of steps that is causing it. It would be great if someone can explain clearly why this is causing deadlock.

我遇到了导致死锁问题的模式和upsert存储过程。我大致了解为什么会导致死锁，以及如何修复死锁。我可以复制它，但是我不清楚导致它的步骤的顺序。如果有人能清楚地解释为什么会导致死锁，那就太好了。

Here is the schema and the stored procedures. This code is being executed on PostgreSQL 9.2.2.

这里是模式和存储过程。这段代码在PostgreSQL 9.2.2上执行。

CREATE TABLE counters (                                                                                                                                                                                                                       
  count_type INTEGER NOT NULL,
  count_id   INTEGER NOT NULL,
  count      INTEGER NOT NULL
);


CREATE TABLE primary_relation (
  id INTEGER PRIMARY KEY,
  a_counter INTEGER NOT NULL DEFAULT 0
);

INSERT INTO primary_relation
SELECT i FROM generate_series(1,5) AS i;

CREATE OR REPLACE FUNCTION increment_count(ctype integer, cid integer, i integer) RETURNS VOID
AS $$
BEGIN
    LOOP
        UPDATE counters
         SET count = count + i 
         WHERE count_type = ctype AND count_id = cid;
         IF FOUND THEN
            RETURN;
          END IF; 
        BEGIN
            INSERT INTO counters (count_type, count_id, count)
             VALUES (ctype, cid, i); 
            RETURN;
        EXCEPTION WHEN OTHERS THEN
        END;
    END LOOP;
END;
$$
LANGUAGE PLPGSQL;

CREATE OR REPLACE FUNCTION update_primary_a_count(ctype integer) RETURNS VOID
AS $$
  WITH deleted_counts_cte AS (
      DELETE
      FROM counters
      WHERE count_type = ctype
      RETURNING *
  ), rollup_cte AS (
      SELECT count_id, SUM(count) AS count
      FROM deleted_counts_cte
      GROUP BY count_id
      HAVING SUM(count) <> 0
  )
  UPDATE primary_relation
  SET a_counter = a_counter + rollup_cte.count
  FROM rollup_cte
  WHERE primary_relation.id = rollup_cte.count_id
$$ LANGUAGE SQL;

And here is a python script to reproduce the deadlock.

这是一个python脚本，用于复制死锁。

import os                                                                                                                                                                                                                                     
import random
import time
import psycopg2

COUNTERS = 5 
THREADS = 10
ITERATIONS = 500 

def increment():
  outf = open('synctest.out.%d' % os.getpid(), 'w')
  conn = psycopg2.connect(database="test")
  cur = conn.cursor()
  for i in range(0,ITERATIONS):
    time.sleep(random.random())
    start = time.time()
    cur.execute("SELECT increment_count(0, %s, 1)", [random.randint(1,COUNTERS)])
    conn.commit()
    outf.write("%f\n" % (time.time() - start))
  conn.close()
  outf.close()

def update(n):
  outf = open('synctest.update', 'w')
  conn = psycopg2.connect(database="test")
  cur = conn.cursor()
  for i in range(0,n):
    time.sleep(random.random())
    start = time.time()
    cur.execute("SELECT update_primary_a_count(0)")
    conn.commit()
    outf.write("%f\n" % (time.time() - start))
  conn.close()

pids = []
for i in range(THREADS):
  pid = os.fork()
  if pid != 0:
    print 'Process %d spawned' % pid 
    pids.append(pid)
  else:
    print 'Starting child %d' % os.getpid()
    increment()
    print 'Exiting child %d' % os.getpid()
    os._exit(0)

update(ITERATIONS)
for pid in pids:
  print "waiting on %d" % pid 
  os.waitpid(pid, 0)

# cleanup
update(1)

I recognize that one issue with this is that the upsert will can produce duplicate rows (with multiple writers) which will likely result in some double counting. But why will this result in deadlock?

我认识到这样做的一个问题是，upsert将生成重复的行(使用多个写入器)，这可能导致重复计数。但是为什么会导致死锁呢?

The error I get from PostgreSQL is something like the following:

我从PostgreSQL得到的错误如下所示:

process 91924 detected deadlock while waiting for ShareLock on transaction 4683083 after 100.559 ms",,,,,"SQL statement ""UPDATE counters

And the client spews something like this:

客户会吐出这样的东西:

psycopg2.extensions.TransactionRollbackError: deadlock detected
DETAIL:  Process 91924 waits for ShareLock on transaction 4683083; blocked by process 91933.
Process 91933 waits for ShareLock on transaction 4683079; blocked by process 91924.
HINT:  See server log for query details.CONTEXT:  SQL statement "UPDATE counters
         SET count = count + i
         WHERE count_type = ctype AND count_id = cid"
PL/pgSQL function increment_count(integer,integer,integer) line 4 at SQL statement

To fix the issue, you need to add a primary key like so:

要解决这个问题，您需要添加主键如下:

ALTER TABLE counters ADD PRIMARY KEY (count_type, count_id);

Any insight would be greatly appreciated. Thanks!

任何见解都将受到极大的赞赏。谢谢!

3 个解决方案

#1

because of the primary key, the number of rows in this table is always <= # threads, and the primary key ensures that no row is repeated.

由于主键，该表中的行数总是<= # threads，主键确保不重复任何行。

When you remove the primary key, some of the threads get lagged behind and the number of rows increases, and at the same time rows get repeated. When the rows get repeated, then the update time is longer and 2 or more threads will try to update the same row(s).

当您删除主键时，一些线程会滞后，并且行数会增加，同时行数也会重复。当行被重复时，更新时间就会变长，两个或更多的线程会尝试更新相同的行。

Open a new terminal and type:

打开一个新的终端和类型:

watch --interval 1 "psql -tc \"select count(*) from counters\" test"

Try this with and without the primary key. When you get the first deadlock, look at the results of the query above. In my case this is what I am left with in the table counters:

尝试使用主键和不使用主键。当您获得第一个死锁时，请查看上面查询的结果。在我的情况下，这是我在表计数器中剩下的:

test=# select * from counters order by 2;
 count_type | count_id | count 
------------+----------+-------
          0 |        1 |   735
          0 |        1 |   733
          0 |        1 |   735
          0 |        1 |   735
          0 |        2 |   916
          0 |        2 |   914
          0 |        2 |   914
          0 |        3 |   882
          0 |        4 |   999
          0 |        5 |   691
          0 |        5 |   692
(11 rows)

#2

Your code is the perfect recipe for a race condition (multiple threads, random sleeps). The problem is most probably due to locking issues, since you don't mention the locking mode i'm going to assume that is a page based lock so, you get the following scenario:

您的代码是竞态条件的完美配方(多线程，随机休眠)。问题很可能是由于锁定问题，因为您没有提到锁定模式，我将假设这是一个基于页面的锁，因此，您将得到以下场景:

Thread 1 starts, it begins to insert records, lets say that it locks page n° 1, and should lock page 2.
线程1开始,开始插入记录,让说它锁n°1页,第2页,应该锁。
Thread 2 starts, at the same time as 1, but it locks first page 2, and should lock page 1 next.
线程2启动，与1同时启动，但它锁定第一页2，并应该锁定第1页。
Both threads are now waiting on each other to complete, so you have a deadlock.
两个线程现在都在等待对方完成，因此会出现死锁。

Now, why a PK fixes it?

为什么PK会修复它呢?

Because locking is done via index at first, you the race condition is mitigated because the PK is unique on the inserts, so all threads wait for the index, and in updates access is done via index so the record is locked based on its PK.

因为锁定首先是通过索引完成的，所以竞争条件得到了缓解，因为PK在插入上是唯一的，所以所有的线程都等待索引，在更新访问中通过索引完成，因此记录是基于它的PK被锁定的。

#3

At some point one user is waiting on a lock another user has, while the first user owns a lock that the second user wants. This is what causes a deadlock.

在某个时刻，一个用户正在等待另一个用户的锁，而第一个用户拥有第二个用户想要的锁。这就是导致死锁的原因。

At a guess, it's because without a primary key (or in fact any key) when you UPDATE counters in your increment sp it is having to read the whole table. The same with the primary_relation table. This is going to leave locks strewn about, and open the way for a deadlock. I'm not a Postgres user so I don't know the details of exactly when it will place locks, but I'm pretty sure that this is what is happening.

据猜测，这是因为在增量sp中更新计数器时，如果没有主键(或者实际上是任何键)，就必须读取整个表。primary_relation表也是如此。这将使锁到处乱放，并为死锁打开道路。我不是Postgres用户，所以我不知道它将在什么时候放置锁，但是我很确定这就是正在发生的事情。

Putting a PK on counters is making it possible for the DB to target the rows it reads accurately and put on the minimum number of locks. You should really have a PK on primary_relation too!

在计数器上放置一个PK使DB可以针对其读取的行进行精确定位，并将锁的数量设置为最少。你应该在primary_relation上也有一个PK !

#1