从列表中删除唯一值并仅保留重复项

时间:2022-06-28 23:41:26

I'm looking to run over a list of ids and return a list of any ids that occurred more than once. This was what I set up that is working:

我正在寻找一个id列表并返回一个多次出现的id列表。这就是我设置的工作方式:

singles = list(ids)
duplicates = []
while len(singles) > 0:
    elem = singles.pop()
    if elem in singles:
        duplicates.append(elem)

But the ids list is likely to get quite long, and I realistically don't want a while loop predicated on an expensive len call if I can avoid it. (I could go the inelegant route and call len once, then just decrement it every iteration but I'd rather avoid that if I could).

但是id列表可能会变得很长,而且我实际上不希望在一个昂贵的len调用上使用while循环,如果我可以避免它。 (我可以走一条不雅的路线,然后调用一次len,然后每次迭代就减少它,但如果可以的话,我宁愿避免使用它)。

4 个解决方案

#1


12  

The smart way to do this is to use a data structure that makes it easy and efficient, like Counter:

这样做的聪明方法是使用一种简单有效的数据结构,例如Counter:

>>> ids = [random.randrange(100) for _ in range(200)]
>>> from collections import Counter
>>> counts = Counter(ids)
>>> dupids = [id for id in ids if counts[id] > 1]

Building the Counter takes O(N) time, as opposed to O(N log N) time for sorting, or O(N^2) for counting each element from scratch every time.

构建计数器需要O(N)时间,而不是O(N log N)时间进行排序,或O(N ^ 2)每次从头开始计算每个元素。


As a side note:

作为旁注:

But the ids list is likely to get quite long, and I realistically don't want a while loop predicated on an expensive len call if I can avoid it.

但是id列表可能会变得很长,而且我实际上不希望在一个昂贵的len调用上使用while循环,如果我可以避免它。

len is not expensive. It's constant time, and (at least on builtin types list list) it's about as fast as a function can possibly get in Python short of doing nothing at all.

len并不贵。它是恒定的时间,并且(至少在内置类型列表列表中)它的速度与函数可能在Python中完全没有做任何事情一样快。

The part of your code that's expensive is calling elem in singles inside the loop—that means for every element, you have to compare it against potentially every other element, meaning quadratic time.

代码中昂贵的部分是在循环内调用elem - 这意味着对于每个元素,你必须将它与潜在的每个元素进行比较,这意味着二次时间。

#2


5  

You could do like this,

你可以这样做,

>>> ids = [1,2,3,2,3,5]
>>> set(i for i in ids if ids.count(i) > 1)
{2, 3}

#3


1  

I presume this will work faster:

我认为这会更快:

occasions = {}
for id in ids:
    try:
        occasions[id] += 1
    except KeyError:
        occasions[id] = 0
result = [id for id in ids if occasions[id] > 1]

#4


-1  

If you don't care about the order in which these ids are retrieved, an efficient approach would consist in a sorting step (which is O(N log(N))) followed by keeping ids that are followed by themselves (which is O(N)). So this approach is overall O(N log(N)).

如果您不关心检索这些ID的顺序,那么有效的方法将包括排序步骤(即O(N log(N))),然后保留随后自己的ID(即O) (N))。所以这种方法总体上是O(N log(N))。

#1


12  

The smart way to do this is to use a data structure that makes it easy and efficient, like Counter:

这样做的聪明方法是使用一种简单有效的数据结构,例如Counter:

>>> ids = [random.randrange(100) for _ in range(200)]
>>> from collections import Counter
>>> counts = Counter(ids)
>>> dupids = [id for id in ids if counts[id] > 1]

Building the Counter takes O(N) time, as opposed to O(N log N) time for sorting, or O(N^2) for counting each element from scratch every time.

构建计数器需要O(N)时间,而不是O(N log N)时间进行排序,或O(N ^ 2)每次从头开始计算每个元素。


As a side note:

作为旁注:

But the ids list is likely to get quite long, and I realistically don't want a while loop predicated on an expensive len call if I can avoid it.

但是id列表可能会变得很长,而且我实际上不希望在一个昂贵的len调用上使用while循环,如果我可以避免它。

len is not expensive. It's constant time, and (at least on builtin types list list) it's about as fast as a function can possibly get in Python short of doing nothing at all.

len并不贵。它是恒定的时间,并且(至少在内置类型列表列表中)它的速度与函数可能在Python中完全没有做任何事情一样快。

The part of your code that's expensive is calling elem in singles inside the loop—that means for every element, you have to compare it against potentially every other element, meaning quadratic time.

代码中昂贵的部分是在循环内调用elem - 这意味着对于每个元素,你必须将它与潜在的每个元素进行比较,这意味着二次时间。

#2


5  

You could do like this,

你可以这样做,

>>> ids = [1,2,3,2,3,5]
>>> set(i for i in ids if ids.count(i) > 1)
{2, 3}

#3


1  

I presume this will work faster:

我认为这会更快:

occasions = {}
for id in ids:
    try:
        occasions[id] += 1
    except KeyError:
        occasions[id] = 0
result = [id for id in ids if occasions[id] > 1]

#4


-1  

If you don't care about the order in which these ids are retrieved, an efficient approach would consist in a sorting step (which is O(N log(N))) followed by keeping ids that are followed by themselves (which is O(N)). So this approach is overall O(N log(N)).

如果您不关心检索这些ID的顺序,那么有效的方法将包括排序步骤(即O(N log(N))),然后保留随后自己的ID(即O) (N))。所以这种方法总体上是O(N log(N))。