重复数据删除表的最佳方法是什么?

时间:2021-10-21 20:24:18

I've seen a couple of solutions for this, but I'm wondering what the best and most efficient way is to de-dupe a table. You can use code (SQL, etc.) to illustrate your point, but I'm just looking for basic algorithms. I assumed there would already be a question about this on SO, but I wasn't able to find one, so if it already exists just give me a heads up.

我已经看到了几个解决方案,但我想知道什么是最好和最有效的方法来重塑表。您可以使用代码(SQL等)来说明您的观点,但我只是在寻找基本算法。我假设在SO上已经有一个关于这个的问题了,但是我找不到一个,所以如果它已经存在,那就给我一个抬头。

(Just to clarify - I'm referring to getting rid of duplicates in a table that has an incremental automatic PK and has some rows that are duplicates in everything but the PK field.)

(只是为了澄清 - 我指的是在具有增量自动PK的表中删除重复项,并且除了PK字段之外的所有行中都有一些重复的行。)

13 个解决方案

#1


9  

SELECT DISTINCT <insert all columns but the PK here> FROM foo. Create a temp table using that query (the syntax varies by RDBMS but there's typically a SELECT … INTO or CREATE TABLE AS pattern available), then blow away the old table and pump the data from the temp table back into it.

SELECT DISTINCT <插入所有列,但pk在这里> FROM foo。使用该查询创建临时表(语法因RDBMS而异,但通常有SELECT ... INTO或CREATE TABLE AS模式可用),然后吹掉旧表并将临时表中的数据泵回到其中。

#2


8  

Using analytic function row_number:

使用解析函数row_number:

WITH CTE (col1, col2, dupcnt)
AS
(
SELECT col1, col2,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS dupcnt
FROM Youtable
)
DELETE
FROM CTE
WHERE dupcnt > 1
GO                                                                 

#3


7  

Deduping is rarely simple. That's because the records to be dedupped often have slightly different values is some of the fields. Therefore choose which record to keep can be problematic. Further, dups are often people records and it is hard to identify if the two John Smith's are two people or one person who is duplicated. So spend a lot (50% or more of the whole project) of your time defining what constitutes a dup and how to handle the differences and child records.

重复数据删除很少很简单。这是因为要进行重复数据删除的记录通常会有一些字段略有不同。因此,选择要保留的记录可能会有问题。此外,重复通常是人员记录,很难确定两个约翰史密斯是两个人还是一个人是重复的。因此,花费大量时间(整个项目的50%或更多)来定义构成dup的内容以及如何处理差异和子记录。

How do you know which is the correct value? Further dedupping requires that you handle all child records not orphaning any. What happens when you find that by changing the id on the child record you are suddenly violating one of the unique indexes or constraints - this will happen eventually and your process needs to handle it. If you have chosen foolishly to apply all your constraints only thorough the application, you may not even know the constraints are violated. When you have 10,000 records to dedup, you aren't going to go through the application to dedup one at a time. If the constraint isn't in the database, lots of luck in maintaining data integrity when you dedup.

你怎么知道哪个是正确的价值?进一步的重复数据删除要求您处理所有不是孤立的子记录。当您发现通过更改子记录中的id突然违反其中一个唯一索引或约束时会发生什么 - 最终会发生这种情况并且您的进程需要处理它。如果您愚蠢地选择仅通过应用程序应用所有约束,您可能甚至不知道违反了约束。当您有10,000条记录要进行重复数据删除时,您不会通过该应用程序一次重复删除一个记录。如果约束不在数据库中,那么在重复数据删除时保持数据完整性的运气很大。

A further complication is that dups don't always match exactly on the name or address. For instance a salesrep named Joan Martin may be a dup of a sales rep names Joan Martin-Jones especially if they have the same address and email. OR you could have John or Johnny in the name. Or the same street address except one record abbreveiated ST. and one spelled out Street. In SQL server you can use SSIS and fuzzy grouping to also identify near matches. These are often the most common dups as the fact that weren't exact matches is why they got put in as dups in the first place.

更复杂的是,重复并不总是与名称或地址完全匹配。例如,名为Joan Martin的销售代表可能是销售代表Joan Martin-Jones的副本,特别是如果他们有相同的地址和电子邮件。或者你可以在名字中加上约翰或约翰尼。或者相同的街道地址,除了一个记录abbreveiated ST。一个拼写出街道。在SQL Server中,您可以使用SSIS和模糊分组来识别近匹配。这些通常是最常见的重复,因为不完全匹配的事实是它们首先被放入重复的原因。

For some types of dedupping, you may need a user interface, so that the person doing the dedupping can choose which of two values to use for a particular field. This is especially true if the person who is being dedupped is in two or more roles. It could be that the data for a particular role is usually better than the data for another role. Or it could be that only the users will know for sure which is the correct value or they may need to contact people to find out if they are genuinely dups or simply two people with the same name.

对于某些类型的重复数据删除,您可能需要一个用户界面,以便执行重复数据删除的人可以选择用于特定字段的两个值中的哪一个。如果被重复数据删除的人有两个或更多角色,则尤其如此。可能是特定角色的数据通常比另一个角色的数据更好。或者可能只有用户才能确定哪个是正确的值,或者他们可能需要联系人们以确定他们是真正的重复还是仅仅是两个具有相同名称的人。

#4


5  

Adding the actual code here for future reference

在此处添加实际代码以供将来参考

So, there are 3 steps, and therefore 3 SQL statements:

因此,有3个步骤,因此有3个SQL语句:

Step 1: Move the non duplicates (unique tuples) into a temporary table

步骤1:将非重复项(唯一元组)移动到临时表中

CREATE TABLE new_table as
SELECT * FROM old_table WHERE 1 GROUP BY [column to remove duplicates by];

Step 2: delete the old table (or rename it) We no longer need the table with all the duplicate entries, so drop it!

第2步:删除旧表(或重命名)我们不再需要包含所有重复条目的表,所以删除它!

DROP TABLE old_table;

Step 3: rename the new_table to the name of the old_table

第3步:将new_table重命名为old_table的名称

RENAME TABLE new_table TO old_table;

And of course, don't forget to fix your buggy code to stop inserting duplicates!

当然,不要忘记修复您的错误代码以停止插入重复项!

#5


3  

Here's the method I use if you can get your dupe criteria into a group by statement and your table has an id identity column for uniqueness:

这是我使用的方法,如果你可以将你的重写标准变成group by语句,并且你的表有唯一性的id标识列:

delete t
from tablename t
inner join  
(
    select date_time, min(id) as min_id
    from tablename
    group by date_time
    having count(*) > 1
) t2 on t.date_time = t2.date_time
where t.id > t2.min_id

In this example the date_time is the grouping criteria, if you have more than one column make sure to join on all of them.

在此示例中,date_time是分组条件,如果您有多个列,请确保连接所有列。

#6


1  

For those of you who prefer a quick and dirty approach, just list all the columns that together define a unique record and create a unique index with those columns, like so:

对于那些喜欢快速而肮脏的方法的人,只需列出一起定义唯一记录的所有列,并使用这些列创建唯一索引,如下所示:

ALTER IGNORE TABLE TABLE_NAME ADD UNIQUE (column1,column2,column3)

ALTER IGNORE TABLE TABLE_NAME ADD UNIQUE(column1,column2,column3)

You can drop the unique index afterwords.

您可以删除后面的唯一索引。

#7


1  

I am taking the one from DShook and providing a dedupe example where you would keep only the record with the highest date.

我从DShook获取一个并提供重复数据删除示例,您只保留具有最高日期的记录。

In this example say I have 3 records all with the same app_id, and I only want to keep the one with the highest date:

在这个例子中说我有3条记录都具有相同的app_id,我只想保留具有最高日期的记录:

DELETE t
FROM @USER_OUTBOX_APPS t
INNER JOIN  
(
    SELECT 
         app_id
        ,max(processed_date) as max_processed_date
    FROM @USER_OUTBOX_APPS
    GROUP BY app_id
    HAVING count(*) > 1
) t2 on 
    t.app_id = t2.app_id
WHERE 
    t.processed_date < t2.max_processed_date

#8


0  

For SQL, you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;

对于SQL,您可以使用INSERT IGNORE INTO表SELECT xy FROM unkeyed_table;

For an algorithm, if you can assume that to-be-primary keys may be repeated, but a to-be-primary-key uniquely identifies the content of the row, than hash only the to-be-primary key and check for repetition.

对于算法,如果您可以假设可以重复成为主要密钥,但是要成为主要密钥唯一地标识行的内容,而不是仅散列成为主要密钥并检查重复。

#9


0  

I think this should require nothing more then just grouping by all columns except the id and choosing one row from every group - for simplicity just the first row, but this does not actually matter besides you have additional constraints on the id.

我认为这应该只需要除了id以外的所有列进行分组并从每个组中选择一行 - 为简单起见只是第一行,但除了你对id有额外的限制之外,这实际上并不重要。

Or the other way around to get rid of the rows ... just delete all rows accept a single one from all groups.

或者反过来摆脱行...只需删除所有行,从所有组中接受一个行。

#10


0  

You could generate a hash for each row (excluding the PK), store it in a new column (or if you can't add new columns, can you move the table to a temp staging area?), and then look for all other rows with the same hash. Of course, you would have to be able to ensure that your hash function doesn't produce the same code for different rows.

您可以为每一行(不包括PK)生成哈希值,将其存储在新列中(或者如果您不能添加新列,是否可以将表移动到临时暂存区?),然后查找所有其他行具有相同散列的行。当然,您必须能够确保您的哈希函数不会为不同的行生成相同的代码。

If two rows are duplicate, does it matter which you get rid of? Is it possible that other data are dependent on both of the duplicates? If so, you will have to go through a few steps:

如果两行重复,那么你摆脱哪些重要?是否有可能其他数据依赖于两个重复项?如果是这样,您将需要完成以下几个步骤:

  • Find the dupes
  • 找到傻瓜
  • Choose one of them as dupeA to eliminate
  • 选择其中一个作为dupeA来消除
  • Find all data dependent on dupeA
  • 查找依赖于dupeA的所有数据
  • Alter that data to refer to dupeB
  • 更改数据以引用dupeB
  • delete dupeA.
  • 删除dupeA。

This could be easy or complicated, depending on your existing data model.

这可能很容易或很复杂,具体取决于您现有的数据模型。

This whole scenario sounds like a maintenance and redesign project. If so, best of luck!!

整个场景听起来像是一个维护和重新设计项目。如果是这样,祝你好运!

#11


0  

This can dedupe the duplicated values in c1:

这可以重复删除c1中的重复值:

select * from foo
minus
select f1.* from foo f1, foo f2
where f1.c1 = f2.c1 and f1.c2 > f2.c2

#12


0  

Here's one I've run into, in real life.

这是我在现实生活中遇到的问题。

Assume you have a table of external/3rd party logins for users, and you're going to merge two users and want to dedupe on the provider/provider key values.

假设您有一个用户外部/第三方登录表,您将合并两个用户并希望对提供者/提供者密钥值进行重复数据删除。

    ;WITH Logins AS
    (
        SELECT [LoginId],[UserId],[Provider],[ProviderKey]
        FROM [dbo].[UserLogin] 
        WHERE [UserId]=@FromUserID -- is the user we're deleting
              OR [UserId]=@ToUserID -- is the user we're moving data to
    ), Ranked AS 
    (
        SELECT Logins.*
            , [Picker]=ROW_NUMBER() OVER (
                       PARTITION BY [Provider],[ProviderKey]
                       ORDER BY CASE WHEN [UserId]=@FromUserID THEN 1 ELSE 0 END)
        FROM Logins
    )
    MERGE Logins AS T
    USING Ranked AS S
    ON S.[LoginId]=T.[LoginID]
    WHEN MATCHED AND S.[Picker]>1 -- duplicate Provider/ProviderKey
                 AND T.[UserID]=@FromUserID -- safety check 
    THEN DELETE
    WHEN MATCHED AND S.[Picker]=1 -- the only or best one
                 AND T.[UserID]=@FromUserID
    THEN UPDATE SET T.[UserID]=@ToUserID
    OUTPUT $action, DELETED.*, INSERTED.*;

#13


0  

These methods will work, but without an explicit id as a PK then determining which rows to delete could be a problem. The bounce out into a temp table delete from original and re-insert without the dupes seems to be the simplest.

这些方法可行,但没有明确的id作为PK,那么确定要删除哪些行可能是个问题。反弹到临时表从原始表中删除并重新插入而没有欺骗似乎是最简单的。

#1


9  

SELECT DISTINCT <insert all columns but the PK here> FROM foo. Create a temp table using that query (the syntax varies by RDBMS but there's typically a SELECT … INTO or CREATE TABLE AS pattern available), then blow away the old table and pump the data from the temp table back into it.

SELECT DISTINCT <插入所有列,但pk在这里> FROM foo。使用该查询创建临时表(语法因RDBMS而异,但通常有SELECT ... INTO或CREATE TABLE AS模式可用),然后吹掉旧表并将临时表中的数据泵回到其中。

#2


8  

Using analytic function row_number:

使用解析函数row_number:

WITH CTE (col1, col2, dupcnt)
AS
(
SELECT col1, col2,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY col1) AS dupcnt
FROM Youtable
)
DELETE
FROM CTE
WHERE dupcnt > 1
GO                                                                 

#3


7  

Deduping is rarely simple. That's because the records to be dedupped often have slightly different values is some of the fields. Therefore choose which record to keep can be problematic. Further, dups are often people records and it is hard to identify if the two John Smith's are two people or one person who is duplicated. So spend a lot (50% or more of the whole project) of your time defining what constitutes a dup and how to handle the differences and child records.

重复数据删除很少很简单。这是因为要进行重复数据删除的记录通常会有一些字段略有不同。因此,选择要保留的记录可能会有问题。此外,重复通常是人员记录,很难确定两个约翰史密斯是两个人还是一个人是重复的。因此,花费大量时间(整个项目的50%或更多)来定义构成dup的内容以及如何处理差异和子记录。

How do you know which is the correct value? Further dedupping requires that you handle all child records not orphaning any. What happens when you find that by changing the id on the child record you are suddenly violating one of the unique indexes or constraints - this will happen eventually and your process needs to handle it. If you have chosen foolishly to apply all your constraints only thorough the application, you may not even know the constraints are violated. When you have 10,000 records to dedup, you aren't going to go through the application to dedup one at a time. If the constraint isn't in the database, lots of luck in maintaining data integrity when you dedup.

你怎么知道哪个是正确的价值?进一步的重复数据删除要求您处理所有不是孤立的子记录。当您发现通过更改子记录中的id突然违反其中一个唯一索引或约束时会发生什么 - 最终会发生这种情况并且您的进程需要处理它。如果您愚蠢地选择仅通过应用程序应用所有约束,您可能甚至不知道违反了约束。当您有10,000条记录要进行重复数据删除时,您不会通过该应用程序一次重复删除一个记录。如果约束不在数据库中,那么在重复数据删除时保持数据完整性的运气很大。

A further complication is that dups don't always match exactly on the name or address. For instance a salesrep named Joan Martin may be a dup of a sales rep names Joan Martin-Jones especially if they have the same address and email. OR you could have John or Johnny in the name. Or the same street address except one record abbreveiated ST. and one spelled out Street. In SQL server you can use SSIS and fuzzy grouping to also identify near matches. These are often the most common dups as the fact that weren't exact matches is why they got put in as dups in the first place.

更复杂的是,重复并不总是与名称或地址完全匹配。例如,名为Joan Martin的销售代表可能是销售代表Joan Martin-Jones的副本,特别是如果他们有相同的地址和电子邮件。或者你可以在名字中加上约翰或约翰尼。或者相同的街道地址,除了一个记录abbreveiated ST。一个拼写出街道。在SQL Server中,您可以使用SSIS和模糊分组来识别近匹配。这些通常是最常见的重复,因为不完全匹配的事实是它们首先被放入重复的原因。

For some types of dedupping, you may need a user interface, so that the person doing the dedupping can choose which of two values to use for a particular field. This is especially true if the person who is being dedupped is in two or more roles. It could be that the data for a particular role is usually better than the data for another role. Or it could be that only the users will know for sure which is the correct value or they may need to contact people to find out if they are genuinely dups or simply two people with the same name.

对于某些类型的重复数据删除,您可能需要一个用户界面,以便执行重复数据删除的人可以选择用于特定字段的两个值中的哪一个。如果被重复数据删除的人有两个或更多角色,则尤其如此。可能是特定角色的数据通常比另一个角色的数据更好。或者可能只有用户才能确定哪个是正确的值,或者他们可能需要联系人们以确定他们是真正的重复还是仅仅是两个具有相同名称的人。

#4


5  

Adding the actual code here for future reference

在此处添加实际代码以供将来参考

So, there are 3 steps, and therefore 3 SQL statements:

因此,有3个步骤,因此有3个SQL语句:

Step 1: Move the non duplicates (unique tuples) into a temporary table

步骤1:将非重复项(唯一元组)移动到临时表中

CREATE TABLE new_table as
SELECT * FROM old_table WHERE 1 GROUP BY [column to remove duplicates by];

Step 2: delete the old table (or rename it) We no longer need the table with all the duplicate entries, so drop it!

第2步:删除旧表(或重命名)我们不再需要包含所有重复条目的表,所以删除它!

DROP TABLE old_table;

Step 3: rename the new_table to the name of the old_table

第3步:将new_table重命名为old_table的名称

RENAME TABLE new_table TO old_table;

And of course, don't forget to fix your buggy code to stop inserting duplicates!

当然,不要忘记修复您的错误代码以停止插入重复项!

#5


3  

Here's the method I use if you can get your dupe criteria into a group by statement and your table has an id identity column for uniqueness:

这是我使用的方法,如果你可以将你的重写标准变成group by语句,并且你的表有唯一性的id标识列:

delete t
from tablename t
inner join  
(
    select date_time, min(id) as min_id
    from tablename
    group by date_time
    having count(*) > 1
) t2 on t.date_time = t2.date_time
where t.id > t2.min_id

In this example the date_time is the grouping criteria, if you have more than one column make sure to join on all of them.

在此示例中,date_time是分组条件,如果您有多个列,请确保连接所有列。

#6


1  

For those of you who prefer a quick and dirty approach, just list all the columns that together define a unique record and create a unique index with those columns, like so:

对于那些喜欢快速而肮脏的方法的人,只需列出一起定义唯一记录的所有列,并使用这些列创建唯一索引,如下所示:

ALTER IGNORE TABLE TABLE_NAME ADD UNIQUE (column1,column2,column3)

ALTER IGNORE TABLE TABLE_NAME ADD UNIQUE(column1,column2,column3)

You can drop the unique index afterwords.

您可以删除后面的唯一索引。

#7


1  

I am taking the one from DShook and providing a dedupe example where you would keep only the record with the highest date.

我从DShook获取一个并提供重复数据删除示例,您只保留具有最高日期的记录。

In this example say I have 3 records all with the same app_id, and I only want to keep the one with the highest date:

在这个例子中说我有3条记录都具有相同的app_id,我只想保留具有最高日期的记录:

DELETE t
FROM @USER_OUTBOX_APPS t
INNER JOIN  
(
    SELECT 
         app_id
        ,max(processed_date) as max_processed_date
    FROM @USER_OUTBOX_APPS
    GROUP BY app_id
    HAVING count(*) > 1
) t2 on 
    t.app_id = t2.app_id
WHERE 
    t.processed_date < t2.max_processed_date

#8


0  

For SQL, you may use the INSERT IGNORE INTO table SELECT xy FROM unkeyed_table;

对于SQL,您可以使用INSERT IGNORE INTO表SELECT xy FROM unkeyed_table;

For an algorithm, if you can assume that to-be-primary keys may be repeated, but a to-be-primary-key uniquely identifies the content of the row, than hash only the to-be-primary key and check for repetition.

对于算法,如果您可以假设可以重复成为主要密钥,但是要成为主要密钥唯一地标识行的内容,而不是仅散列成为主要密钥并检查重复。

#9


0  

I think this should require nothing more then just grouping by all columns except the id and choosing one row from every group - for simplicity just the first row, but this does not actually matter besides you have additional constraints on the id.

我认为这应该只需要除了id以外的所有列进行分组并从每个组中选择一行 - 为简单起见只是第一行,但除了你对id有额外的限制之外,这实际上并不重要。

Or the other way around to get rid of the rows ... just delete all rows accept a single one from all groups.

或者反过来摆脱行...只需删除所有行,从所有组中接受一个行。

#10


0  

You could generate a hash for each row (excluding the PK), store it in a new column (or if you can't add new columns, can you move the table to a temp staging area?), and then look for all other rows with the same hash. Of course, you would have to be able to ensure that your hash function doesn't produce the same code for different rows.

您可以为每一行(不包括PK)生成哈希值,将其存储在新列中(或者如果您不能添加新列,是否可以将表移动到临时暂存区?),然后查找所有其他行具有相同散列的行。当然,您必须能够确保您的哈希函数不会为不同的行生成相同的代码。

If two rows are duplicate, does it matter which you get rid of? Is it possible that other data are dependent on both of the duplicates? If so, you will have to go through a few steps:

如果两行重复,那么你摆脱哪些重要?是否有可能其他数据依赖于两个重复项?如果是这样,您将需要完成以下几个步骤:

  • Find the dupes
  • 找到傻瓜
  • Choose one of them as dupeA to eliminate
  • 选择其中一个作为dupeA来消除
  • Find all data dependent on dupeA
  • 查找依赖于dupeA的所有数据
  • Alter that data to refer to dupeB
  • 更改数据以引用dupeB
  • delete dupeA.
  • 删除dupeA。

This could be easy or complicated, depending on your existing data model.

这可能很容易或很复杂,具体取决于您现有的数据模型。

This whole scenario sounds like a maintenance and redesign project. If so, best of luck!!

整个场景听起来像是一个维护和重新设计项目。如果是这样,祝你好运!

#11


0  

This can dedupe the duplicated values in c1:

这可以重复删除c1中的重复值:

select * from foo
minus
select f1.* from foo f1, foo f2
where f1.c1 = f2.c1 and f1.c2 > f2.c2

#12


0  

Here's one I've run into, in real life.

这是我在现实生活中遇到的问题。

Assume you have a table of external/3rd party logins for users, and you're going to merge two users and want to dedupe on the provider/provider key values.

假设您有一个用户外部/第三方登录表,您将合并两个用户并希望对提供者/提供者密钥值进行重复数据删除。

    ;WITH Logins AS
    (
        SELECT [LoginId],[UserId],[Provider],[ProviderKey]
        FROM [dbo].[UserLogin] 
        WHERE [UserId]=@FromUserID -- is the user we're deleting
              OR [UserId]=@ToUserID -- is the user we're moving data to
    ), Ranked AS 
    (
        SELECT Logins.*
            , [Picker]=ROW_NUMBER() OVER (
                       PARTITION BY [Provider],[ProviderKey]
                       ORDER BY CASE WHEN [UserId]=@FromUserID THEN 1 ELSE 0 END)
        FROM Logins
    )
    MERGE Logins AS T
    USING Ranked AS S
    ON S.[LoginId]=T.[LoginID]
    WHEN MATCHED AND S.[Picker]>1 -- duplicate Provider/ProviderKey
                 AND T.[UserID]=@FromUserID -- safety check 
    THEN DELETE
    WHEN MATCHED AND S.[Picker]=1 -- the only or best one
                 AND T.[UserID]=@FromUserID
    THEN UPDATE SET T.[UserID]=@ToUserID
    OUTPUT $action, DELETED.*, INSERTED.*;

#13


0  

These methods will work, but without an explicit id as a PK then determining which rows to delete could be a problem. The bounce out into a temp table delete from original and re-insert without the dupes seems to be the simplest.

这些方法可行,但没有明确的id作为PK,那么确定要删除哪些行可能是个问题。反弹到临时表从原始表中删除并重新插入而没有欺骗似乎是最简单的。