在MYSQL表中传播不重复的重复项

时间:2022-08-04 13:06:43

I have a table set up as follows

我有一个表格设置如下

id
origin
destination
carrier_id

so typical row could be,

所以典型的行可能是,

100: London    Manchester  366

Now each route goes both ways, so there shouldn't be a row like this

现在每条路线都是双向的,所以不应该有这样的行

233: Manchester    London    366

since that's essentially the same route (for my purposes anyway)

因为这基本上是相同的路线(无论如何我的目的)

Unfortunately though, i have wound up with a handful of duplicates. I have over 50,000 routes made up of around 2000 point of origin (or destination, however you want to look at it) in the table. So i'm thinking looping through each point of origin to find duplicates would be insane.

不幸的是,我已经完成了一些重复。我有超过50,000条路线,由大约2000个起点(或目的地,但你想看它)组成。因此,我正在考虑循环遍历每个原点,以发现重复将是疯狂的。

So I don't even know where to start trying to figure out a query to identify them. Any ideas?

所以我甚至不知道从哪里开始尝试找出一个查询来识别它们。有任何想法吗?

3 个解决方案

#1


I think you just need a double join, the following will identify all the "duplicate" records joined together.

我认为你只需要一个双连接,以下将识别连接在一起的所有“重复”记录。

Here's an example.

这是一个例子。

Say SELECT * FROM FLIGHTS yielded:

说SELECT * FROM FLIGHTS产生:

id  origin   destination  carrierid
1   toronto  quebec      1
2   quebec   toronto     2
3   edmonton calgary     3
4   calgary  edmonton    4
5   hull     vancouver   5
6   vancouveredmonton    6
7   edmonton toronto     7
9   edmonton quebec      8
10   toronto  edmonton  9
11   quebec   edmonton  10
12   calgary  lethbridge 11

So there's a bunch of duplicates (4 of the routes are duplicates of some other route).

所以有一堆重复(其中4条路线与其他路线重复)。

select  *
from    flights t1 inner join flights t2 on t1.origin = t2.destination 
        AND t2.origin = t1.destination

would yield just the duplicates:

只会产生重复:

id  origin   destination carrierid  id  origin  destination carrierid
1   toronto quebec       1  2   quebec  toronto 2
2   quebec  toronto      2  1   toronto quebec  1
3   edmonton    calgary 3   4   calgary edmonton    4
4   calgary edmonton    4   3   edmonton    calgary 3
7   edmonton    toronto 7   10  toronto edmonton    9
9   edmonton    quebec  8   11  quebec  edmonton    10
10  toronto edmonton    9   7   edmonton    toronto 7
11  quebec  edmonton    10  9   edmonton    quebec  8

At that point you just might delete all the ones that occurred 1st.

那时你可能会删除所有发生的第一个。

delete from flights
where id in (
    select  t1.id
    from    flights t1 inner join flights t2 on t1.origin = t2.destination 
            AND t2.origin = t1.destination
)

Good luck!

#2


Bummer! Off the top of my head (and in psuedo-sql):

坏消息!在我的头顶(和psuedo-sql):

select * from (
  select id, concat(origin, '_', destination, '_', carrier_id) as key from ....
  union
  select id, concat(destination, '_', origin, '_', carrier_id) as key from ....

) having count(key) > 1;

For the records above, you'd end up with:

对于上面的记录,你最终得到:

100, London_Manchester_366
100, Manchester_Longer_366
233 Manchester_London_366
233 London_Manchester_366

That's really, really hackish, and doesn't give you exactly what you're doing - it only narrows it down. Maybe it'll give you a starting point? Maybe it'll give someone else some ideas they can provide to help you too.

这真的,真的是hackish,并没有准确地告诉你你正在做什么 - 它只会缩小它。也许它会给你一个起点?也许它会给别人一些他们可以提供帮助的想法。

#3


If you don't mind a little shell scripting, and if you can get a dump of the input in the form you've shown here... and here's my sample input:

如果您不介意一点shell脚本,并且您可以在此处显示的表单中获取输入的转储...这里是我的示例输入:

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
233: Manchester London 366

You might be able to do something like this:

你可以做这样的事情:

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort
CityA CityB 150:
CityA London 121:
CityA Manchester 144:
London Manchester 100:
London Manchester 233:

So that you at least have the pairs grouped together. Not sure what would be the best move from there.

这样你至少可以将这些对组合在一起。不确定那里最好的举动是什么。


Okay, here's a beast of a command line:

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]"
150:151:150:255:CityA CityB
100:233:London Manchester

where m.txt has these new contents:

其中m.txt包含以下新内容:

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
151: CityB CityA 90
233: Manchester London 366
255: CityA CityB 90

Perl probably would have been a better choice than awk, but here goes: First we sort the two city names and put the ID at the end of the string, which I did in the first section. Then we sort those to group pairs together, and we have to tack on an extra line for the awk script to finish up. Then, we loop over each line in the file. If we see a new pair of cities, we print the cities we previously saw, and we store the new cities and the new ID. If we see the same cities we saw last time, then we print out the ID of the previous line and the ID of this line. Finally, we grep only lines beginning with a number so that we discard non-duplicated pairs.

Perl可能是比awk更好的选择,但是这里说:首先我们对两个城市名称进行排序并将ID放在字符串的末尾,这是我在第一部分中所做的。然后我们将它们分组到一起组合,我们必须为awk脚本添加额外的一行来完成。然后,我们遍历文件中的每一行。如果我们看到一对新城市,我们会打印我们之前看到的城市,并存储新城市和新ID。如果我们看到上次看到的相同城市,那么我们会打印出上一行的ID和该行的ID。最后,我们只搜索以数字开头的行,以便我们丢弃非重复的对。

If a pair occurs more than twice, you'll get a duplicate ID, but that's not such a big deal.

如果一对出现超过两次,你将获得一个重复的ID,但这不是什么大问题。

Clear as mud?

像泥一样清楚?

#1


I think you just need a double join, the following will identify all the "duplicate" records joined together.

我认为你只需要一个双连接,以下将识别连接在一起的所有“重复”记录。

Here's an example.

这是一个例子。

Say SELECT * FROM FLIGHTS yielded:

说SELECT * FROM FLIGHTS产生:

id  origin   destination  carrierid
1   toronto  quebec      1
2   quebec   toronto     2
3   edmonton calgary     3
4   calgary  edmonton    4
5   hull     vancouver   5
6   vancouveredmonton    6
7   edmonton toronto     7
9   edmonton quebec      8
10   toronto  edmonton  9
11   quebec   edmonton  10
12   calgary  lethbridge 11

So there's a bunch of duplicates (4 of the routes are duplicates of some other route).

所以有一堆重复(其中4条路线与其他路线重复)。

select  *
from    flights t1 inner join flights t2 on t1.origin = t2.destination 
        AND t2.origin = t1.destination

would yield just the duplicates:

只会产生重复:

id  origin   destination carrierid  id  origin  destination carrierid
1   toronto quebec       1  2   quebec  toronto 2
2   quebec  toronto      2  1   toronto quebec  1
3   edmonton    calgary 3   4   calgary edmonton    4
4   calgary edmonton    4   3   edmonton    calgary 3
7   edmonton    toronto 7   10  toronto edmonton    9
9   edmonton    quebec  8   11  quebec  edmonton    10
10  toronto edmonton    9   7   edmonton    toronto 7
11  quebec  edmonton    10  9   edmonton    quebec  8

At that point you just might delete all the ones that occurred 1st.

那时你可能会删除所有发生的第一个。

delete from flights
where id in (
    select  t1.id
    from    flights t1 inner join flights t2 on t1.origin = t2.destination 
            AND t2.origin = t1.destination
)

Good luck!

#2


Bummer! Off the top of my head (and in psuedo-sql):

坏消息!在我的头顶(和psuedo-sql):

select * from (
  select id, concat(origin, '_', destination, '_', carrier_id) as key from ....
  union
  select id, concat(destination, '_', origin, '_', carrier_id) as key from ....

) having count(key) > 1;

For the records above, you'd end up with:

对于上面的记录,你最终得到:

100, London_Manchester_366
100, Manchester_Longer_366
233 Manchester_London_366
233 London_Manchester_366

That's really, really hackish, and doesn't give you exactly what you're doing - it only narrows it down. Maybe it'll give you a starting point? Maybe it'll give someone else some ideas they can provide to help you too.

这真的,真的是hackish,并没有准确地告诉你你正在做什么 - 它只会缩小它。也许它会给你一个起点?也许它会给别人一些他们可以提供帮助的想法。

#3


If you don't mind a little shell scripting, and if you can get a dump of the input in the form you've shown here... and here's my sample input:

如果您不介意一点shell脚本,并且您可以在此处显示的表单中获取输入的转储...这里是我的示例输入:

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
233: Manchester London 366

You might be able to do something like this:

你可以做这样的事情:

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort
CityA CityB 150:
CityA London 121:
CityA Manchester 144:
London Manchester 100:
London Manchester 233:

So that you at least have the pairs grouped together. Not sure what would be the best move from there.

这样你至少可以将这些对组合在一起。不确定那里最好的举动是什么。


Okay, here's a beast of a command line:

$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]"
150:151:150:255:CityA CityB
100:233:London Manchester

where m.txt has these new contents:

其中m.txt包含以下新内容:

100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
151: CityB CityA 90
233: Manchester London 366
255: CityA CityB 90

Perl probably would have been a better choice than awk, but here goes: First we sort the two city names and put the ID at the end of the string, which I did in the first section. Then we sort those to group pairs together, and we have to tack on an extra line for the awk script to finish up. Then, we loop over each line in the file. If we see a new pair of cities, we print the cities we previously saw, and we store the new cities and the new ID. If we see the same cities we saw last time, then we print out the ID of the previous line and the ID of this line. Finally, we grep only lines beginning with a number so that we discard non-duplicated pairs.

Perl可能是比awk更好的选择,但是这里说:首先我们对两个城市名称进行排序并将ID放在字符串的末尾,这是我在第一部分中所做的。然后我们将它们分组到一起组合,我们必须为awk脚本添加额外的一行来完成。然后,我们遍历文件中的每一行。如果我们看到一对新城市,我们会打印我们之前看到的城市,并存储新城市和新ID。如果我们看到上次看到的相同城市,那么我们会打印出上一行的ID和该行的ID。最后,我们只搜索以数字开头的行,以便我们丢弃非重复的对。

If a pair occurs more than twice, you'll get a duplicate ID, but that's not such a big deal.

如果一对出现超过两次,你将获得一个重复的ID,但这不是什么大问题。

Clear as mud?

像泥一样清楚?