从表中删除没有唯一键的重复行

时间:2023-01-20 19:19:19

How do I delete duplicates rows in Postgres 9 table, the rows are completely duplicates on every field AND there is no individual field that could be used as a unique key so I cant just GROUP BY columns and use a NOT IN statement.

如何在Postgres 9表中删除重复行,每个字段上的行完全重复,并且没有单独的字段可以用作唯一键,所以我不能只使用GROUP BY列并使用NOT IN语句。

I'm looking for a single SQL statement, not a solution that requires me to create temporary table and insert records into that. I know how to do that but requires more work to fit into my automated process.

我正在寻找一个单独的SQL语句,而不是一个需要我创建临时表并将记录插入其中的解决方案。我知道如何做到这一点,但需要更多的工作来适应我的自动化过程。

Table definition:

表定义:

jthinksearch=> \d releases_labels;
Unlogged table "discogs.releases_labels"
   Column   |  Type   | Modifiers
------------+---------+-----------
 label      | text    |
 release_id | integer |
 catno      | text    |
Indexes:
    "releases_labels_catno_idx" btree (catno)
    "releases_labels_name_idx" btree (label)
Foreign-key constraints:
    "foreign_did" FOREIGN KEY (release_id) REFERENCES release(id)

Sample data:

样本数据:

jthinksearch=> select * from releases_labels  where release_id=6155;
    label     | release_id |   catno
--------------+------------+------------
 Warp Records |       6155 | WAP 39 CDR
 Warp Records |       6155 | WAP 39 CDR

5 个解决方案

#1


7  

If you can afford to rewrite the whole table, this is probably the simplest approach:

如果你有能力重写整个表,这可能是最简单的方法:

WITH Deleted AS (
  DELETE FROM discogs.releases_labels
  RETURNING *
)
INSERT INTO discogs.releases_labels
SELECT DISTINCT * FROM Deleted

If you need to specifically target the duplicated records, you can make use of the internal ctid field, which uniquely identifies a row:

如果需要专门定位重复记录,可以使用唯一标识行的内部ctid字段:

DELETE FROM discogs.releases_labels
WHERE ctid NOT IN (
  SELECT MIN(ctid)
  FROM discogs.releases_labels
  GROUP BY label, release_id, catno
)

Be very careful with ctid; it changes over time. But you can rely on it staying the same within the scope of a single statement.

对ctid要非常小心;它随着时间而变化。但是你可以依赖它在单个语句的范围内保持不变。

#2


4  

Single SQL statement

Here is a solution that deletes duplicates in place:

这是一个删除重复项的解决方案:

DELETE FROM releases_labels r
WHERE  EXISTS (
   SELECT 1
   FROM   releases_labels r1
   WHERE  r1 = r
   AND    r1.ctid < r.ctid
   );

Since there is no unique key I am (ab)using the tuple ID ctid for the purpose. The physically first row survives in each set of dupes.

由于没有唯一的密钥,因此我(ab)使用元组ID ctid。物理上的第一排在每组欺骗中幸存。

ctid is a system column that is not part of the associated row type, so when referencing the whole row with table aliases in the expression r1 = r, only visible columns are compared (not the ctid or others). That's why the whole row can be equal and one ctid is still smaller than the other.

ctid是一个不属于关联行类型的系统列,因此当在表达式r1 = r中引用表别名的整行时,只比较可见列(不是ctid或其他列)。这就是为什么整行可以相等而且一个ctid仍然比另一个小。

With only few duplicates, this is also the fastest of all solutions.
With lots of duplicates other solutions are faster.

只有很少的重复项,这也是所有解决方案中最快的。有很多重复项,其他解决方案更快。

Then I suggest:

然后我建议:

ALTER TABLE discogs.releases_labels ADD COLUMN releases_labels_id serial PRIMARY KEY;

Why does it work with NULL values?

This is somewhat surprising. The reason is explained in the chapter Composite Type Comparison in the manual:

这有点令人惊讶。原因在手册中的复合类型比较一章中进行了解释:

The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.

如果结果依赖于比较两个NULL值或NULL和非NULL,则SQL规范要求按行进行比较以返回NULL。 PostgreSQL仅在比较两个行构造函数的结果时(如第9.23.5节)或将行构造函数与子查询的输出进行比较(如第9.22节所述)。在比较两个复合类型值的其他上下文中,两个NULL字段值被认为是相等的,并且认为NULL大于非NULL。这对于复合类型具有一致的排序和索引行为是必要的。

Bold emphasis mine.

大胆强调我的。

Alternatives with second table

I removed that section, because the solution with a data-modifying CTE provided by @Nick is better.

我删除了那个部分,因为@Nick提供的数据修改CTE的解决方案更好。

#3


0  

You can try like this:

你可以尝试这样:

CREATE TABLE temp 
INSERT INTO temp SELECT DISTINCT * FROM discogs.releases_labels;
DROP TABLE discogs.releases_labels;
ALTER TABLE temp RENAME TO discogs.releases_labels;

#4


0  

As you have no primary key, there is no easy way to distinguish one duplicated line from any other one. That's one of the reasons why it is highly recommended that any table have a primary key (*).

由于您没有主键,因此没有简单的方法可以将一个重复的行与任何其他行区分开来。这就是为什么强烈建议任何表都有主键(*)的原因之一。

So you are left with only 2 solutions :

所以你只剩下2个解决方案:

  • use a temporary table as suggested by Rahul (IMHO the simpler and cleaner way) (**)
  • 使用Rahul建议的临时表(恕我直言,更简单,更清洁的方式)(**)
  • use procedural SQL and a cursor either from a procedural language such as Python or [put here your prefered language] or with PL/pgSQL. Something like (beware untested) :

    使用过程SQL和游标来自过程语言,如Python或[放在这里你的首选语言]或PL / pgSQL。像(小心未经测试)的东西:

    CREATE OR REPLACE FUNCTION deduplicate() RETURNS integer AS $$
    DECLARE
     curs CURSOR FOR SELECT * FROM releases_labels ORDER BY label, release_id, catno;
     r releases_labels%ROWTYPE;
     old releases_labels%ROWTYPE;
     n integer;
    BEGIN
     n := 0;
     old := NULL;
     FOR rec IN curs LOOP
      r := rec;
      IF r = old THEN
       DELETE FROM releases_labels WHERE CURRENT OF curs;
       n := n + 1;
      END IF;
      old := rec;
     END LOOP;
     RETURN n;
    END;
    $$ LANGUAGE plpgsql;
    
    SELECT deduplicate();
    

    should delete duplicate lines and return the number of lines actually deleted. It is not necessarily the most efficient way, but you only touch rows that need to be deleted so you will not have to lock whole table.

    应该删除重复的行并返回实际删除的行数。它不一定是最有效的方式,但您只需要触摸需要删除的行,这样您就不必锁定整个表。

(*) hopefully PostgreSQL offers the ctid pseudo column that you can use as a key. If you table contains an oid column, you can also use it as it will never change.

(*)希望PostgreSQL提供可用作密钥的ctid伪列。如果您的表包含oid列,您也可以使用它,因为它永远不会更改。

(**) PostgreSQL WITH allows you to do that in in single SQL statement

(**)PostgreSQL WITH允许您在单个SQL语句中执行此操作

This two points from answer from Nick Barnes

这两点来自尼克巴恩斯的回答

#5


0  

Since you also need to avoid duplicates in the future, you could add a surrogate key and a unique constraint while dedupping:

由于您还需要避免将来重复,因此您可以在重复数据删除时添加代理键和唯一约束:


-- add surrogate key
ALTER TABLE releases_labels
        ADD column id SERIAL NOT NULL PRIMARY KEY
        ;

-- verify
SELECT * FROM releases_labels;

DELETE FROM releases_labels dd
WHERE EXISTS (SELECT *
        FROM releases_labels x
        WHERE x.label = dd.label
        AND x.release_id = dd.release_id
        AND x.catno = dd.catno
        AND x.id < dd.id
        );

-- verify
SELECT * FROM releases_labels;

-- add unique constraint for the natural key
ALTER TABLE releases_labels
        ADD UNIQUE (label,release_id,catno)
        ;

-- verify
SELECT * FROM releases_labels;

#1


7  

If you can afford to rewrite the whole table, this is probably the simplest approach:

如果你有能力重写整个表,这可能是最简单的方法:

WITH Deleted AS (
  DELETE FROM discogs.releases_labels
  RETURNING *
)
INSERT INTO discogs.releases_labels
SELECT DISTINCT * FROM Deleted

If you need to specifically target the duplicated records, you can make use of the internal ctid field, which uniquely identifies a row:

如果需要专门定位重复记录,可以使用唯一标识行的内部ctid字段:

DELETE FROM discogs.releases_labels
WHERE ctid NOT IN (
  SELECT MIN(ctid)
  FROM discogs.releases_labels
  GROUP BY label, release_id, catno
)

Be very careful with ctid; it changes over time. But you can rely on it staying the same within the scope of a single statement.

对ctid要非常小心;它随着时间而变化。但是你可以依赖它在单个语句的范围内保持不变。

#2


4  

Single SQL statement

Here is a solution that deletes duplicates in place:

这是一个删除重复项的解决方案:

DELETE FROM releases_labels r
WHERE  EXISTS (
   SELECT 1
   FROM   releases_labels r1
   WHERE  r1 = r
   AND    r1.ctid < r.ctid
   );

Since there is no unique key I am (ab)using the tuple ID ctid for the purpose. The physically first row survives in each set of dupes.

由于没有唯一的密钥,因此我(ab)使用元组ID ctid。物理上的第一排在每组欺骗中幸存。

ctid is a system column that is not part of the associated row type, so when referencing the whole row with table aliases in the expression r1 = r, only visible columns are compared (not the ctid or others). That's why the whole row can be equal and one ctid is still smaller than the other.

ctid是一个不属于关联行类型的系统列,因此当在表达式r1 = r中引用表别名的整行时,只比较可见列(不是ctid或其他列)。这就是为什么整行可以相等而且一个ctid仍然比另一个小。

With only few duplicates, this is also the fastest of all solutions.
With lots of duplicates other solutions are faster.

只有很少的重复项,这也是所有解决方案中最快的。有很多重复项,其他解决方案更快。

Then I suggest:

然后我建议:

ALTER TABLE discogs.releases_labels ADD COLUMN releases_labels_id serial PRIMARY KEY;

Why does it work with NULL values?

This is somewhat surprising. The reason is explained in the chapter Composite Type Comparison in the manual:

这有点令人惊讶。原因在手册中的复合类型比较一章中进行了解释:

The SQL specification requires row-wise comparison to return NULL if the result depends on comparing two NULL values or a NULL and a non-NULL. PostgreSQL does this only when comparing the results of two row constructors (as in Section 9.23.5) or comparing a row constructor to the output of a subquery (as in Section 9.22). In other contexts where two composite-type values are compared, two NULL field values are considered equal, and a NULL is considered larger than a non-NULL. This is necessary in order to have consistent sorting and indexing behavior for composite types.

如果结果依赖于比较两个NULL值或NULL和非NULL,则SQL规范要求按行进行比较以返回NULL。 PostgreSQL仅在比较两个行构造函数的结果时(如第9.23.5节)或将行构造函数与子查询的输出进行比较(如第9.22节所述)。在比较两个复合类型值的其他上下文中,两个NULL字段值被认为是相等的,并且认为NULL大于非NULL。这对于复合类型具有一致的排序和索引行为是必要的。

Bold emphasis mine.

大胆强调我的。

Alternatives with second table

I removed that section, because the solution with a data-modifying CTE provided by @Nick is better.

我删除了那个部分,因为@Nick提供的数据修改CTE的解决方案更好。

#3


0  

You can try like this:

你可以尝试这样:

CREATE TABLE temp 
INSERT INTO temp SELECT DISTINCT * FROM discogs.releases_labels;
DROP TABLE discogs.releases_labels;
ALTER TABLE temp RENAME TO discogs.releases_labels;

#4


0  

As you have no primary key, there is no easy way to distinguish one duplicated line from any other one. That's one of the reasons why it is highly recommended that any table have a primary key (*).

由于您没有主键,因此没有简单的方法可以将一个重复的行与任何其他行区分开来。这就是为什么强烈建议任何表都有主键(*)的原因之一。

So you are left with only 2 solutions :

所以你只剩下2个解决方案:

  • use a temporary table as suggested by Rahul (IMHO the simpler and cleaner way) (**)
  • 使用Rahul建议的临时表(恕我直言,更简单,更清洁的方式)(**)
  • use procedural SQL and a cursor either from a procedural language such as Python or [put here your prefered language] or with PL/pgSQL. Something like (beware untested) :

    使用过程SQL和游标来自过程语言,如Python或[放在这里你的首选语言]或PL / pgSQL。像(小心未经测试)的东西:

    CREATE OR REPLACE FUNCTION deduplicate() RETURNS integer AS $$
    DECLARE
     curs CURSOR FOR SELECT * FROM releases_labels ORDER BY label, release_id, catno;
     r releases_labels%ROWTYPE;
     old releases_labels%ROWTYPE;
     n integer;
    BEGIN
     n := 0;
     old := NULL;
     FOR rec IN curs LOOP
      r := rec;
      IF r = old THEN
       DELETE FROM releases_labels WHERE CURRENT OF curs;
       n := n + 1;
      END IF;
      old := rec;
     END LOOP;
     RETURN n;
    END;
    $$ LANGUAGE plpgsql;
    
    SELECT deduplicate();
    

    should delete duplicate lines and return the number of lines actually deleted. It is not necessarily the most efficient way, but you only touch rows that need to be deleted so you will not have to lock whole table.

    应该删除重复的行并返回实际删除的行数。它不一定是最有效的方式,但您只需要触摸需要删除的行,这样您就不必锁定整个表。

(*) hopefully PostgreSQL offers the ctid pseudo column that you can use as a key. If you table contains an oid column, you can also use it as it will never change.

(*)希望PostgreSQL提供可用作密钥的ctid伪列。如果您的表包含oid列,您也可以使用它,因为它永远不会更改。

(**) PostgreSQL WITH allows you to do that in in single SQL statement

(**)PostgreSQL WITH允许您在单个SQL语句中执行此操作

This two points from answer from Nick Barnes

这两点来自尼克巴恩斯的回答

#5


0  

Since you also need to avoid duplicates in the future, you could add a surrogate key and a unique constraint while dedupping:

由于您还需要避免将来重复,因此您可以在重复数据删除时添加代理键和唯一约束:


-- add surrogate key
ALTER TABLE releases_labels
        ADD column id SERIAL NOT NULL PRIMARY KEY
        ;

-- verify
SELECT * FROM releases_labels;

DELETE FROM releases_labels dd
WHERE EXISTS (SELECT *
        FROM releases_labels x
        WHERE x.label = dd.label
        AND x.release_id = dd.release_id
        AND x.catno = dd.catno
        AND x.id < dd.id
        );

-- verify
SELECT * FROM releases_labels;

-- add unique constraint for the natural key
ALTER TABLE releases_labels
        ADD UNIQUE (label,release_id,catno)
        ;

-- verify
SELECT * FROM releases_labels;