MySQL:消除重复行,而不破坏外键约束。

时间:2022-05-17 04:37:04

I've got a customer database filled with normalized addresses. There are duplicates.

我有一个客户数据库,里面充满了标准化的地址。有重复的。

Each user created their own record, and entered their own address. So we have a 1-to-1 relationship between the users and the addresses:

每个用户创建自己的记录,并输入自己的地址。所以我们在用户和地址之间有1比1的关系:

CREATE TABLE `users` (
  `UserID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Name` VARCHAR(63),
  `Email` VARCHAR(63),
  `AddressID` INT UNSIGNED,
  PRIMARY KEY (`UserID`) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `addresses` (
  `AddressID` INT UNSIGNED NOT NULL AUTO_INCREMENT,
  `Duplicate` VARCHAR(1),
  `Address1` VARCHAR(63) DEFAULT NULL,
  `Address2` VARCHAR(63) DEFAULT NULL,
  `City` VARCHAR(63) DEFAULT NULL,
  `State` VARCHAR(2) DEFAULT NULL,
  `ZIP` VARCHAR(10) DEFAULT NULL,
  PRIMARY KEY (`AddressID`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;  

And the data:

和数据:

INSERT INTO `users` VALUES
    (1,  'Michael', 'michael@email.com', 1),
    (2,  'Steve',   'steve@email.com',   2),
    (3,  'Judy',    'judy@email.com',    3),
    (4,  'Kathy',   'kathy@email.com',   4),
    (5,  'Mark',    'mark@email.com',    5),
    (6,  'Robert',  'robert@email.com',  6),
    (7,  'Susan',   'susan@email.com',   7),
    (8,  'Paul',    'paul@email.com',    8),
    (9,  'Patrick', 'patrick@email.com', 9),
    (10, 'Mary',    'mary@email.com',    10),
    (11, 'James',   'james@email.com',   11),
    (12, 'Barbara', 'barbara@email.com', 12),
    (13, 'Peter',   'peter@email.com',   13);


INSERT INTO `addresses` VALUES
    (1,  '',  '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (2,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (3,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (4,  '',  '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (5,  '',  '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (6,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (7,  'Y', '1000 State Street',  'Apt C', 'Sunnydale',   'OH', '54321'),
    (8,  'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (9,  '',  '1000 State Street',  'Apt A', 'Sunnydale',   'OH', '54321'),
    (10, 'Y', '1234 Main Street',   '',      'Springfield', 'KS', '54321'),
    (11, 'Y', '5678 Sycamore Lane', '',      'Upstate',     'NY', '50000'),
    (12, 'Y', '1000 Main Street',   'Apt A', 'Sunnydale',   'OH', '54321'),
    (13, '',  '9999 Valleyview',    '',      'Springfield', 'KS', '54321');

Oh yes, let me add in that foreign key relationship:

哦,是的,让我补充一下外键关系:

ALTER TABLE `users` ADD CONSTRAINT `AddressID` 
FOREIGN KEY `AddressID` (`AddressID`)
REFERENCES `addresses` (`AddressID`);

We had our address list scrubbed by a 3rd-party service that normalized the data and indicated where we had duplicates. This is where the Duplicate column came from. If there is a 'Y', it is a duplicate of another address. The primary address is NOT marked as a duplicate, as shown in the sample data.

我们的地址列表被一个第三方服务删除了,该服务将数据规范化,并指出我们的副本在哪里。这就是复制列的来源。如果有一个“Y”,它是另一个地址的复制品。主地址不标记为重复地址,如示例数据所示。

I obviously want to remove all of the duplicate records, but there are user records that point to them. I need them to point to the version of the address that is NOT a duplicate.

显然,我想删除所有重复的记录,但是有指向它们的用户记录。我需要他们指出不是重复地址的版本。

So how can I update the AddressID in users to match the non-duplicate addresses?

那么,如何更新用户的地址以匹配非重复地址呢?

The only way I can think to do it is by iterating through all of the data using a high-level language, but I'm fairly sure that MySQL has all the tools required to do something like this in a better way.

我能想到的唯一方法是使用高级语言遍历所有的数据,但是我很确定MySQL拥有以更好的方式做这些事情所需的所有工具。

Here's what I've tried:

这是我试过:

SELECT COUNT(*) as cnt, GROUP_CONCAT(AddressID ORDER BY AddressID) AS ids
FROM addresses
GROUP BY Address1, Address2, City, State, ZIP
HAVING cnt > 1;

+-----+--------------+
| cnt | ids          |
+-----+--------------+
|   2 | 5,7          |
|   6 | 1,2,3,6,8,10 |
|   2 | 4,11         |
+-----+--------------+
3 rows in set (0.00 sec)

From there, I could loop through each result row and do this:

从这里开始,我可以循环遍历每个结果行,然后这样做:

UPDATE `users` SET `AddressID` = 1 WHERE `AddressID` IN (2,3,6,8,10);

But there has got to be a better MySQL-only way, shouldn't there?

但是一定要有更好的mysq -only的方式,不是吗?

When everything is said and done, the data SHOULD look like this:

当一切都说了又做了,数据应该是这样的:

SELECT * FROM `users`;
+--------+---------+-------------------+-----------+
| UserID | Name    | Email             | AddressID |
+--------+---------+-------------------+-----------+
|      1 | Michael | michael@email.com |         1 |
|      2 | Steve   | steve@email.com   |         1 |
|      3 | Judy    | judy@email.com    |         1 |
|      4 | Kathy   | kathy@email.com   |         4 |
|      5 | Mark    | mark@email.com    |         5 |
|      6 | Robert  | robert@email.com  |         1 |
|      7 | Susan   | susan@email.com   |         5 |
|      8 | Paul    | paul@email.com    |         1 |
|      9 | Patrick | patrick@email.com |         9 |
|     10 | Mary    | mary@email.com    |         1 |
|     11 | James   | james@email.com   |         4 |
|     12 | Barbara | barbara@email.com |         1 |
|     13 | Peter   | peter@email.com   |        13 |
+--------+---------+-------------------+-----------+
13 rows in set (0.00 sec)

SELECT * FROM `addresses`;
+-----------+-----------+--------------------+----------+-------------+-------+-------+
| AddressID | Duplicate | Address1           | Address2 | City        | State | ZIP   |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
|         1 |           | 1234 Main Street   |          | Springfield | KS    | 54321 |
|         4 |           | 5678 Sycamore Lane |          | Upstate     | NY    | 50000 |
|         5 |           | 1000 State Street  | Apt C    | Sunnydale   | OH    | 54321 |
|         9 |           | 1000 State Street  | Apt A    | Sunnydale   | OH    | 54321 |
|        13 |           | 9999 Valleyview    |          | Springfield | KS    | 54321 |
+-----------+-----------+--------------------+----------+-------------+-------+-------+
5 rows in set (0.00 sec)

Help?

帮助吗?

2 个解决方案

#1


1  

To select the results you want to see:

选择您想要看到的结果:

SELECT   a.UserID
        ,a.Name
        ,a.Email
        ,(
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     DUPLICATE != 'Y'

        ) as AddressID
FROM    users a
JOIN    addresses b
ON      a.AddressID = b.AddressID

This will update the users table to the results shown in the query above.

这会将users表更新为上面查询中显示的结果。

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID  = 
        (
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     Duplicate != 'Y'
        )
WHERE Duplicate = 'Y'

Note that with the sample data you provided, #12 Barbara's ID is null in the SELECT query since her address is marked as duplicate when in fact it is unique to the list provided. It does not match address 1 as indicated in the "how it should look" results.

注意,对于您提供的示例数据,#12 Barbara的ID在SELECT查询中为空,因为她的地址被标记为重复,而实际上它对所提供的列表是唯一的。它不匹配地址1,如“它应该如何看起来”结果所示。

Edit

编辑

In order to handle incorrect duplicate flags like #12 Barbara, or maybe other missed duplicates that have not marked as such, you can skip the duplicate flag column check and just use ORDER BY & LIMIT on the sub-query so that it will return the first lowest matching address ID, regardless of the duplicate flag:

为了处理不正确的复制旗帜像# 12芭芭拉,或者其他错过副本没有标记,您可以跳过复制标志列检查和只使用order BY和限制子查询,这样它会返回第一个匹配的最低地址ID,无论重复标志:

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID = 
        (
            SELECT      addressID 
            FROM        addresses c
            WHERE       c.Address1 = b.Address1
            AND         c.Address2 = b.Address2
            AND         c.City = b.City
            AND         c.State = b.State
            AND         c.ZIP = b.ZIP
            ORDER BY    c.addressID ASC
            LIMIT       1
        )

#2


2  

You have a many-to-one relationship between users and addresses (that is multiple users can map to the same address). This seems a bit odd to me, but I suppose it could be useful. Many-to-many would make more sense, i.e. a user can have multiple addresses, but the same address can be shared by multiple users. Generally, a single user has multiple addresses. Updating your schema may help, but I digress.

用户和地址之间有多对一的关系(即多个用户可以映射到同一个地址)。这对我来说似乎有点奇怪,但我想它可能有用。多对多可能更有意义,即用户可以有多个地址,但是相同的地址可以由多个用户共享。通常,一个用户有多个地址。更新您的模式可能有帮助,但是我离题了。

UPDATE users
-- We only care about users mapped to duplicate addresses
JOIN addresses dupe ON (users.AddressID = dupe.AddressID AND dupe.Duplicate='Y')
-- If your normalizer thingy worked right, these will all map to non-duplicates
JOIN addresses nondupe ON (dupe.Address1 = nondupe.Address1
    -- Compare to other columns if you want
    AND nondupe.Duplicate = '')
-- Set to the nondupe ID
SET users.AddressID = nondupe.AddressID;

http://sqlfiddle.com/#!2/5d303/1

http://sqlfiddle.com/ ! 2/5d303/1

#1


1  

To select the results you want to see:

选择您想要看到的结果:

SELECT   a.UserID
        ,a.Name
        ,a.Email
        ,(
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     DUPLICATE != 'Y'

        ) as AddressID
FROM    users a
JOIN    addresses b
ON      a.AddressID = b.AddressID

This will update the users table to the results shown in the query above.

这会将users表更新为上面查询中显示的结果。

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID  = 
        (
            SELECT  addressID 
            FROM    addresses c
            WHERE   c.Address1 = b.Address1
            AND     c.Address2 = b.Address2
            AND     c.City = b.City
            AND     c.State = b.State
            AND     c.ZIP = b.ZIP
            AND     Duplicate != 'Y'
        )
WHERE Duplicate = 'Y'

Note that with the sample data you provided, #12 Barbara's ID is null in the SELECT query since her address is marked as duplicate when in fact it is unique to the list provided. It does not match address 1 as indicated in the "how it should look" results.

注意,对于您提供的示例数据,#12 Barbara的ID在SELECT查询中为空,因为她的地址被标记为重复,而实际上它对所提供的列表是唯一的。它不匹配地址1,如“它应该如何看起来”结果所示。

Edit

编辑

In order to handle incorrect duplicate flags like #12 Barbara, or maybe other missed duplicates that have not marked as such, you can skip the duplicate flag column check and just use ORDER BY & LIMIT on the sub-query so that it will return the first lowest matching address ID, regardless of the duplicate flag:

为了处理不正确的复制旗帜像# 12芭芭拉,或者其他错过副本没有标记,您可以跳过复制标志列检查和只使用order BY和限制子查询,这样它会返回第一个匹配的最低地址ID,无论重复标志:

UPDATE  users a
JOIN    addresses b
ON      a.AddressID = b.AddressID
SET     a.addressID = 
        (
            SELECT      addressID 
            FROM        addresses c
            WHERE       c.Address1 = b.Address1
            AND         c.Address2 = b.Address2
            AND         c.City = b.City
            AND         c.State = b.State
            AND         c.ZIP = b.ZIP
            ORDER BY    c.addressID ASC
            LIMIT       1
        )

#2


2  

You have a many-to-one relationship between users and addresses (that is multiple users can map to the same address). This seems a bit odd to me, but I suppose it could be useful. Many-to-many would make more sense, i.e. a user can have multiple addresses, but the same address can be shared by multiple users. Generally, a single user has multiple addresses. Updating your schema may help, but I digress.

用户和地址之间有多对一的关系(即多个用户可以映射到同一个地址)。这对我来说似乎有点奇怪,但我想它可能有用。多对多可能更有意义,即用户可以有多个地址,但是相同的地址可以由多个用户共享。通常,一个用户有多个地址。更新您的模式可能有帮助,但是我离题了。

UPDATE users
-- We only care about users mapped to duplicate addresses
JOIN addresses dupe ON (users.AddressID = dupe.AddressID AND dupe.Duplicate='Y')
-- If your normalizer thingy worked right, these will all map to non-duplicates
JOIN addresses nondupe ON (dupe.Address1 = nondupe.Address1
    -- Compare to other columns if you want
    AND nondupe.Duplicate = '')
-- Set to the nondupe ID
SET users.AddressID = nondupe.AddressID;

http://sqlfiddle.com/#!2/5d303/1

http://sqlfiddle.com/ ! 2/5d303/1