只使用MySQL查询删除副本?

时间:2023-02-05 04:41:03

I have a table with the following columns:

我有一张列有以下几栏的表格:

URL_ID    
URL_ADDR    
URL_Time

I want to remove duplicates on the URL_ADDR column using a MySQL query.

我想使用MySQL查询删除URL_ADDR列上的重复项。

Is it possible to do such a thing without using any programming?

在不使用任何编程的情况下,是否可以做这样的事情?

7 个解决方案

#1


31  

Consider the following test case:

考虑以下测试用例:

CREATE TABLE mytb (url_id int, url_addr varchar(100));

INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');

Where our test table now contains:

我们的测试表现在包含:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      4 | www.google.com    |
|      5 | www.cnn.com       |
|      6 | www.apple.com     |
+--------+-------------------+
5 rows in set (0.00 sec)

Then we can use the multiple-table DELETE syntax as follows:

然后我们可以使用多表删除语法如下:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);

... which will delete duplicate entries, leaving only the first url based on url_id:

…这将删除重复项,只留下基于url_id的第一个url:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      5 | www.cnn.com       |
+--------+-------------------+
3 rows in set (0.00 sec)

UPDATE - Further to new comments above:

更新-进一步了解以上的新评论:

If the duplicate URLs will not have the same format, you may want to apply the REPLACE() function to remove www. or http:// parts. For example:

如果重复的url没有相同的格式,您可能需要应用REPLACE()函数来删除www。或http://部分。例如:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') = 
                   REPLACE(t1.url_addr, 'www.', '') AND 
                   t2.url_id > t1.url_id);

#2


8  

You may want to try the method mentioned at http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/.

您可能想尝试http://labs.creativecommons.org/2010/01/12/removing-duplicate- rows-inmysql/中提到的方法。

ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);

#3


5  

This will leave the ones with the highest URL_ID for a particular URL_ADDR

这将为特定的URL_ADDR保留最高的URL_ID

DELETE FROM table
WHERE URL_ID NOT IN 
    (SELECT ID FROM 
       (SELECT MAX(URL_ID) AS ID 
        FROM table 
        WHERE URL_ID IS NOT NULL
        GROUP BY URL_ADDR ) X)   /*Sounds like you would need to GROUP BY a 
                                   calculated form - e.g. using REPLACE to 
                                  strip out www see Daniel's answer*/

(The derived table 'X' is to avoid the error "You can't specify target table 'tablename' for update in FROM clause")

(派生表'X'是为了避免出现“无法为FROM子句中更新指定目标表'tablename'的错误”)

#4


3  

Well, you could always:

好吧,你总是可以:

  1. create a temporary table;
  2. 创建一个临时表;
  3. INSERT INTO ... SELECT DISTINCT into the temp table from original table;
  4. 插入……从原始表中选择不同的temp表;
  5. clear original table
  6. 清晰的原始表
  7. INSERT INTO ... SELECT into the original table from the temp table
  8. 插入……从临时表中选择原始表。
  9. drop temp table.
  10. 删除临时表。

It's clumsy and awkward, and requires several queries (not to mention privileges), but it will do the trick if you don't find another solution.

它笨拙而笨拙,并且需要多个查询(更不用说特权),但是如果您没有找到其他解决方案,它将发挥作用。

#5


1  

Daniel Vassallo How to for multiple column?

Daniel Vassallo如何为多列?

DELETE t2 FROM directory1 t1 JOIN directory1 t2 ON (t2.page = t1.page, t2.parentTopic = t1.parentTopic, t2.title = t1.title, t2.description = t1.description, t2.linktype = t1.linktype, t2.priority = t1.priority AND t2.linkID > t1.linkID);

从directory1 t1连接directory1 t2中删除t2 (t2)。页面= t1。页面,t2。parentTopic = t1。parentTopic,t2。title = t1。title, t2.description = t1.description, t2。linktype = t1。linktype,t2。优先级= t1。优先级和t2。linkID > t1.linkID);

maybe like this?

也许像这样?

#6


0  

You can group by on the URL_ADDR which will effectively give you only distinct values in the URL_ADDR field.

您可以对URL_ADDR进行分组,这将有效地在URL_ADDR字段中只提供不同的值。

select 
 URL_ID
 URL_ADDR
 URL_Time
from
 some_table
group by
 URL_ADDR

Enjoy!

享受吧!

#7


0  

This will work provided that your URL_ID column is unique.

如果您的URL_ID列是唯一的,那么这将有效。

DELETE FROM url WHERE URL_ID IN (
SELECT URL_ID
FROM url a INNER JOIN (
    SELECT URL_ADDR, MAX(URL_ID) MaxURLId 
    FROM url
    GROUP BY URL_ADDR
    HAVING COUNT(*) > 1) b ON a.URL_ID <> b.MaxURLId AND a.URL_ADDR = b.URL_ADDR
)

#1


31  

Consider the following test case:

考虑以下测试用例:

CREATE TABLE mytb (url_id int, url_addr varchar(100));

INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');

Where our test table now contains:

我们的测试表现在包含:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      4 | www.google.com    |
|      5 | www.cnn.com       |
|      6 | www.apple.com     |
+--------+-------------------+
5 rows in set (0.00 sec)

Then we can use the multiple-table DELETE syntax as follows:

然后我们可以使用多表删除语法如下:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);

... which will delete duplicate entries, leaving only the first url based on url_id:

…这将删除重复项,只留下基于url_id的第一个url:

SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr          |
+--------+-------------------+
|      1 | www.google.com    |
|      2 | www.microsoft.com |
|      3 | www.apple.com     |
|      5 | www.cnn.com       |
+--------+-------------------+
3 rows in set (0.00 sec)

UPDATE - Further to new comments above:

更新-进一步了解以上的新评论:

If the duplicate URLs will not have the same format, you may want to apply the REPLACE() function to remove www. or http:// parts. For example:

如果重复的url没有相同的格式,您可能需要应用REPLACE()函数来删除www。或http://部分。例如:

DELETE t2
FROM   mytb t1
JOIN   mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') = 
                   REPLACE(t1.url_addr, 'www.', '') AND 
                   t2.url_id > t1.url_id);

#2


8  

You may want to try the method mentioned at http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/.

您可能想尝试http://labs.creativecommons.org/2010/01/12/removing-duplicate- rows-inmysql/中提到的方法。

ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);

#3


5  

This will leave the ones with the highest URL_ID for a particular URL_ADDR

这将为特定的URL_ADDR保留最高的URL_ID

DELETE FROM table
WHERE URL_ID NOT IN 
    (SELECT ID FROM 
       (SELECT MAX(URL_ID) AS ID 
        FROM table 
        WHERE URL_ID IS NOT NULL
        GROUP BY URL_ADDR ) X)   /*Sounds like you would need to GROUP BY a 
                                   calculated form - e.g. using REPLACE to 
                                  strip out www see Daniel's answer*/

(The derived table 'X' is to avoid the error "You can't specify target table 'tablename' for update in FROM clause")

(派生表'X'是为了避免出现“无法为FROM子句中更新指定目标表'tablename'的错误”)

#4


3  

Well, you could always:

好吧,你总是可以:

  1. create a temporary table;
  2. 创建一个临时表;
  3. INSERT INTO ... SELECT DISTINCT into the temp table from original table;
  4. 插入……从原始表中选择不同的temp表;
  5. clear original table
  6. 清晰的原始表
  7. INSERT INTO ... SELECT into the original table from the temp table
  8. 插入……从临时表中选择原始表。
  9. drop temp table.
  10. 删除临时表。

It's clumsy and awkward, and requires several queries (not to mention privileges), but it will do the trick if you don't find another solution.

它笨拙而笨拙,并且需要多个查询(更不用说特权),但是如果您没有找到其他解决方案,它将发挥作用。

#5


1  

Daniel Vassallo How to for multiple column?

Daniel Vassallo如何为多列?

DELETE t2 FROM directory1 t1 JOIN directory1 t2 ON (t2.page = t1.page, t2.parentTopic = t1.parentTopic, t2.title = t1.title, t2.description = t1.description, t2.linktype = t1.linktype, t2.priority = t1.priority AND t2.linkID > t1.linkID);

从directory1 t1连接directory1 t2中删除t2 (t2)。页面= t1。页面,t2。parentTopic = t1。parentTopic,t2。title = t1。title, t2.description = t1.description, t2。linktype = t1。linktype,t2。优先级= t1。优先级和t2。linkID > t1.linkID);

maybe like this?

也许像这样?

#6


0  

You can group by on the URL_ADDR which will effectively give you only distinct values in the URL_ADDR field.

您可以对URL_ADDR进行分组,这将有效地在URL_ADDR字段中只提供不同的值。

select 
 URL_ID
 URL_ADDR
 URL_Time
from
 some_table
group by
 URL_ADDR

Enjoy!

享受吧!

#7


0  

This will work provided that your URL_ID column is unique.

如果您的URL_ID列是唯一的,那么这将有效。

DELETE FROM url WHERE URL_ID IN (
SELECT URL_ID
FROM url a INNER JOIN (
    SELECT URL_ADDR, MAX(URL_ID) MaxURLId 
    FROM url
    GROUP BY URL_ADDR
    HAVING COUNT(*) > 1) b ON a.URL_ID <> b.MaxURLId AND a.URL_ADDR = b.URL_ADDR
)