如何检查平面列表中是否有重复项?

时间:2022-06-18 20:00:55

For example, given the list ['one', 'two', 'one'], the algorithm should return True, whereas given ['one', 'two', 'three'] it should return False.

例如,给定列表[' 1 '、' 2 '、' 1 '],算法应该返回True,而如果给定[' 1 '、' 2 '、' 3 '],算法应该返回False。

6 个解决方案

#1


268  

Use set() to remove duplicates if all values are hashable:

如果所有值都是可清洗的,则使用set()删除重复值:

>>> your_list = ['one', 'two', 'one']
>>> len(your_list) != len(set(your_list))
True

#2


34  

Recommended for short lists only:

只适用于短名单:

any(thelist.count(x) > 1 for x in thelist)

Do not use on a long list -- it can take time proportional to the square of the number of items in the list!

不要在长名单上使用——它可以花时间与列表中项目数量的平方。

For longer lists with hashable items (strings, numbers, &c):

有可洗物品(字符串、数字、c)的长列表:

def anydup(thelist):
  seen = set()
  for x in thelist:
    if x in seen: return True
    seen.add(x)
  return False

If your items are not hashable (sublists, dicts, etc) it gets hairier, though it may still be possible to get O(N logN) if they're at least comparable. But you need to know or test the characteristics of the items (hashable or not, comparable or not) to get the best performance you can -- O(N) for hashables, O(N log N) for non-hashable comparables, otherwise it's down to O(N squared) and there's nothing one can do about it:-(.

如果你的物品不耐洗(子列表、修饰语等),它会变得更黑,尽管如果它们至少是可比较的,它仍然有可能得到O(N logN)。但是你需要知道或测试物品的特性(耐洗与否,可比性与否)以获得最好的性能- O(N)对于hashables, O(N log N)对于不可耐洗的可比性,否则就会降到O(N²),对此没有办法:-()

#3


9  

This is old, but the answers here led me to a slightly different solution. If you are up for abusing comprehensions, you can get short-circuiting this way.

这是旧的,但是这里的答案让我得出了一个稍微不同的答案。如果你被控滥用知识,你可能会因此而短路。

xs = [1, 2, 1]
s = set()
any(x in s or s.add(x) for x in xs)
# You can use a similar approach to actually retrieve the duplicates.
s = set()
duplicates = set(x for x in xs if x in s or s.add(x))

#4


8  

If you are fond of functional programming style, here is a useful function, self-documented and tested code using doctest.

如果您喜欢函数式编程风格,这里有一个有用的函数,使用doctest编写和测试代码。

def decompose(a_list):
    """Turns a list into a set of all elements and a set of duplicated elements.

    Returns a pair of sets. The first one contains elements
    that are found at least once in the list. The second one
    contains elements that appear more than once.

    >>> decompose([1,2,3,5,3,2,6])
    (set([1, 2, 3, 5, 6]), set([2, 3]))
    """
    return reduce(
        lambda (u, d), o : (u.union([o]), d.union(u.intersection([o]))),
        a_list,
        (set(), set()))

if __name__ == "__main__":
    import doctest
    doctest.testmod()

From there you can test unicity by checking whether the second element of the returned pair is empty:

在那里,您可以通过检查返回的对的第二个元素是否为空来测试单一性:

def is_set(l):
    """Test if there is no duplicate element in l.

    >>> is_set([1,2,3])
    True
    >>> is_set([1,2,1])
    False
    >>> is_set([])
    True
    """
    return not decompose(l)[1]

Note that this is not efficient since you are explicitly constructing the decomposition. But along the line of using reduce, you can come up to something equivalent (but slightly less efficient) to answer 5:

注意,这不是有效的,因为您正在显式地构造分解。但是,在使用reduce的过程中,你可以找到一个等价的(但效率稍微低一些)来回答5:

def is_set(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

#5


5  

I recently answered a related question to establish all the duplicates in a list, using a generator. It has the advantage that if used just to establish 'if there is a duplicate' then you just need to get the first item and the rest can be ignored, which is the ultimate shortcut.

我最近回答了一个相关的问题,使用生成器建立列表中的所有副本。它的优点是,如果只是用来建立“如果有重复”,那么你只需要得到第一个条目,其余的可以被忽略,这是最终的捷径。

This is an interesting set based approach I adapted straight from moooeeeep:

这是一个有趣的基于集合的方法,我直接从moooeeeep中改编:

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

Accordingly, a full list of dupes would be list(getDupes(etc)). To simply test "if" there is a dupe, it should be wrapped as follows:

因此,dupes的完整列表将是list(getDupes(等))。要简单地测试“如果”存在一个dupe,应该将其包装如下:

def hasDupes(l):
    try:
        if getDupes(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

This scales well and provides consistent operating times wherever the dupe is in the list -- I tested with lists of up to 1m entries. If you know something about the data, specifically, that dupes are likely to show up in the first half, or other things that let you skew your requirements, like needing to get the actual dupes, then there are a couple of really alternative dupe locators that might outperform. The two I recommend are...

这一规模很好,并且提供了一致的操作时间,无论在哪里,dupe都在列表中——我测试了多达100万个条目的列表。如果您对数据有所了解,特别是那些在前半部分可能出现的dupe,或者其他使您偏离需求的东西,比如需要获得实际的dupe定位器,那么就会有一些真正的替代dupe定位器可能表现得更好。我推荐的两个是……

Simple dict based approach, very readable:

简单的基于词典的方法,非常易读:

def getDupes(c):
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

Leverage itertools (essentially an ifilter/izip/tee) on the sorted list, very efficient if you are getting all the dupes though not as quick to get just the first:

利用迭代工具(本质上是一个ifilter/izip/tee)在排序列表上,如果你得到所有的dupes,非常有效,尽管不是那么快得到第一个:

def getDupes(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.ifilter(lambda x: x[0]==x[1], itertools.izip(a, b)):
        if k != r:
            yield k
            r = k

These were the top performers from the approaches I tried for the full dupe list, with the first dupe occurring anywhere in a 1m element list from the start to the middle. It was surprising how little overhead the sort step added. Your mileage may vary, but here are my specific timed results:

这些是我为完整的dupe列表所尝试的方法的最佳表现,第一个dupe从开始到中间出现在1m元素列表的任何位置。令人惊讶的是,排序步骤所增加的开销是如此之少。你的里程可能不同,但以下是我的具体计时结果:

Finding FIRST duplicate, single dupe places "n" elements in to 1m element array

Test set len change :        50 -  . . . . .  -- 0.002
Test in dict        :        50 -  . . . . .  -- 0.002
Test in set         :        50 -  . . . . .  -- 0.002
Test sort/adjacent  :        50 -  . . . . .  -- 0.023
Test sort/groupby   :        50 -  . . . . .  -- 0.026
Test sort/zip       :        50 -  . . . . .  -- 1.102
Test sort/izip      :        50 -  . . . . .  -- 0.035
Test sort/tee/izip  :        50 -  . . . . .  -- 0.024
Test moooeeeep      :        50 -  . . . . .  -- 0.001 *
Test iter*/sorted   :        50 -  . . . . .  -- 0.027

Test set len change :      5000 -  . . . . .  -- 0.017
Test in dict        :      5000 -  . . . . .  -- 0.003 *
Test in set         :      5000 -  . . . . .  -- 0.004
Test sort/adjacent  :      5000 -  . . . . .  -- 0.031
Test sort/groupby   :      5000 -  . . . . .  -- 0.035
Test sort/zip       :      5000 -  . . . . .  -- 1.080
Test sort/izip      :      5000 -  . . . . .  -- 0.043
Test sort/tee/izip  :      5000 -  . . . . .  -- 0.031
Test moooeeeep      :      5000 -  . . . . .  -- 0.003 *
Test iter*/sorted   :      5000 -  . . . . .  -- 0.031

Test set len change :     50000 -  . . . . .  -- 0.035
Test in dict        :     50000 -  . . . . .  -- 0.023
Test in set         :     50000 -  . . . . .  -- 0.023
Test sort/adjacent  :     50000 -  . . . . .  -- 0.036
Test sort/groupby   :     50000 -  . . . . .  -- 0.134
Test sort/zip       :     50000 -  . . . . .  -- 1.121
Test sort/izip      :     50000 -  . . . . .  -- 0.054
Test sort/tee/izip  :     50000 -  . . . . .  -- 0.045
Test moooeeeep      :     50000 -  . . . . .  -- 0.019 *
Test iter*/sorted   :     50000 -  . . . . .  -- 0.055

Test set len change :    500000 -  . . . . .  -- 0.249
Test in dict        :    500000 -  . . . . .  -- 0.145
Test in set         :    500000 -  . . . . .  -- 0.165
Test sort/adjacent  :    500000 -  . . . . .  -- 0.139
Test sort/groupby   :    500000 -  . . . . .  -- 1.138
Test sort/zip       :    500000 -  . . . . .  -- 1.159
Test sort/izip      :    500000 -  . . . . .  -- 0.126
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.120 *
Test moooeeeep      :    500000 -  . . . . .  -- 0.131
Test iter*/sorted   :    500000 -  . . . . .  -- 0.157

#6


1  

Another way of doing this succinctly is with Counter.

另一种简洁的方法是使用计数器。

To just determine if there are any duplicates in the original list:

确定原始列表中是否有重复:

from collections import Counter

def has_dupes(l):
    return Counter(l).most_common()[0][0] > 1

Or to get a list of items that have duplicates:

或获取具有重复项的项目列表:

def get_dupes(l):
    return [k for k, v in Counter(l).items() if v > 1]

#1


268  

Use set() to remove duplicates if all values are hashable:

如果所有值都是可清洗的,则使用set()删除重复值:

>>> your_list = ['one', 'two', 'one']
>>> len(your_list) != len(set(your_list))
True

#2


34  

Recommended for short lists only:

只适用于短名单:

any(thelist.count(x) > 1 for x in thelist)

Do not use on a long list -- it can take time proportional to the square of the number of items in the list!

不要在长名单上使用——它可以花时间与列表中项目数量的平方。

For longer lists with hashable items (strings, numbers, &c):

有可洗物品(字符串、数字、c)的长列表:

def anydup(thelist):
  seen = set()
  for x in thelist:
    if x in seen: return True
    seen.add(x)
  return False

If your items are not hashable (sublists, dicts, etc) it gets hairier, though it may still be possible to get O(N logN) if they're at least comparable. But you need to know or test the characteristics of the items (hashable or not, comparable or not) to get the best performance you can -- O(N) for hashables, O(N log N) for non-hashable comparables, otherwise it's down to O(N squared) and there's nothing one can do about it:-(.

如果你的物品不耐洗(子列表、修饰语等),它会变得更黑,尽管如果它们至少是可比较的,它仍然有可能得到O(N logN)。但是你需要知道或测试物品的特性(耐洗与否,可比性与否)以获得最好的性能- O(N)对于hashables, O(N log N)对于不可耐洗的可比性,否则就会降到O(N²),对此没有办法:-()

#3


9  

This is old, but the answers here led me to a slightly different solution. If you are up for abusing comprehensions, you can get short-circuiting this way.

这是旧的,但是这里的答案让我得出了一个稍微不同的答案。如果你被控滥用知识,你可能会因此而短路。

xs = [1, 2, 1]
s = set()
any(x in s or s.add(x) for x in xs)
# You can use a similar approach to actually retrieve the duplicates.
s = set()
duplicates = set(x for x in xs if x in s or s.add(x))

#4


8  

If you are fond of functional programming style, here is a useful function, self-documented and tested code using doctest.

如果您喜欢函数式编程风格,这里有一个有用的函数,使用doctest编写和测试代码。

def decompose(a_list):
    """Turns a list into a set of all elements and a set of duplicated elements.

    Returns a pair of sets. The first one contains elements
    that are found at least once in the list. The second one
    contains elements that appear more than once.

    >>> decompose([1,2,3,5,3,2,6])
    (set([1, 2, 3, 5, 6]), set([2, 3]))
    """
    return reduce(
        lambda (u, d), o : (u.union([o]), d.union(u.intersection([o]))),
        a_list,
        (set(), set()))

if __name__ == "__main__":
    import doctest
    doctest.testmod()

From there you can test unicity by checking whether the second element of the returned pair is empty:

在那里,您可以通过检查返回的对的第二个元素是否为空来测试单一性:

def is_set(l):
    """Test if there is no duplicate element in l.

    >>> is_set([1,2,3])
    True
    >>> is_set([1,2,1])
    False
    >>> is_set([])
    True
    """
    return not decompose(l)[1]

Note that this is not efficient since you are explicitly constructing the decomposition. But along the line of using reduce, you can come up to something equivalent (but slightly less efficient) to answer 5:

注意,这不是有效的,因为您正在显式地构造分解。但是,在使用reduce的过程中,你可以找到一个等价的(但效率稍微低一些)来回答5:

def is_set(l):
    try:
        def func(s, o):
            if o in s:
                raise Exception
            return s.union([o])
        reduce(func, l, set())
        return True
    except:
        return False

#5


5  

I recently answered a related question to establish all the duplicates in a list, using a generator. It has the advantage that if used just to establish 'if there is a duplicate' then you just need to get the first item and the rest can be ignored, which is the ultimate shortcut.

我最近回答了一个相关的问题,使用生成器建立列表中的所有副本。它的优点是,如果只是用来建立“如果有重复”,那么你只需要得到第一个条目,其余的可以被忽略,这是最终的捷径。

This is an interesting set based approach I adapted straight from moooeeeep:

这是一个有趣的基于集合的方法,我直接从moooeeeep中改编:

def getDupes(l):
    seen = set()
    seen_add = seen.add
    for x in l:
        if x in seen or seen_add(x):
            yield x

Accordingly, a full list of dupes would be list(getDupes(etc)). To simply test "if" there is a dupe, it should be wrapped as follows:

因此,dupes的完整列表将是list(getDupes(等))。要简单地测试“如果”存在一个dupe,应该将其包装如下:

def hasDupes(l):
    try:
        if getDupes(c).next(): return True    # Found a dupe
    except StopIteration:
        pass
    return False

This scales well and provides consistent operating times wherever the dupe is in the list -- I tested with lists of up to 1m entries. If you know something about the data, specifically, that dupes are likely to show up in the first half, or other things that let you skew your requirements, like needing to get the actual dupes, then there are a couple of really alternative dupe locators that might outperform. The two I recommend are...

这一规模很好,并且提供了一致的操作时间,无论在哪里,dupe都在列表中——我测试了多达100万个条目的列表。如果您对数据有所了解,特别是那些在前半部分可能出现的dupe,或者其他使您偏离需求的东西,比如需要获得实际的dupe定位器,那么就会有一些真正的替代dupe定位器可能表现得更好。我推荐的两个是……

Simple dict based approach, very readable:

简单的基于词典的方法,非常易读:

def getDupes(c):
    d = {}
    for i in c:
        if i in d:
            if d[i]:
                yield i
                d[i] = False
        else:
            d[i] = True

Leverage itertools (essentially an ifilter/izip/tee) on the sorted list, very efficient if you are getting all the dupes though not as quick to get just the first:

利用迭代工具(本质上是一个ifilter/izip/tee)在排序列表上,如果你得到所有的dupes,非常有效,尽管不是那么快得到第一个:

def getDupes(c):
    a, b = itertools.tee(sorted(c))
    next(b, None)
    r = None
    for k, g in itertools.ifilter(lambda x: x[0]==x[1], itertools.izip(a, b)):
        if k != r:
            yield k
            r = k

These were the top performers from the approaches I tried for the full dupe list, with the first dupe occurring anywhere in a 1m element list from the start to the middle. It was surprising how little overhead the sort step added. Your mileage may vary, but here are my specific timed results:

这些是我为完整的dupe列表所尝试的方法的最佳表现,第一个dupe从开始到中间出现在1m元素列表的任何位置。令人惊讶的是,排序步骤所增加的开销是如此之少。你的里程可能不同,但以下是我的具体计时结果:

Finding FIRST duplicate, single dupe places "n" elements in to 1m element array

Test set len change :        50 -  . . . . .  -- 0.002
Test in dict        :        50 -  . . . . .  -- 0.002
Test in set         :        50 -  . . . . .  -- 0.002
Test sort/adjacent  :        50 -  . . . . .  -- 0.023
Test sort/groupby   :        50 -  . . . . .  -- 0.026
Test sort/zip       :        50 -  . . . . .  -- 1.102
Test sort/izip      :        50 -  . . . . .  -- 0.035
Test sort/tee/izip  :        50 -  . . . . .  -- 0.024
Test moooeeeep      :        50 -  . . . . .  -- 0.001 *
Test iter*/sorted   :        50 -  . . . . .  -- 0.027

Test set len change :      5000 -  . . . . .  -- 0.017
Test in dict        :      5000 -  . . . . .  -- 0.003 *
Test in set         :      5000 -  . . . . .  -- 0.004
Test sort/adjacent  :      5000 -  . . . . .  -- 0.031
Test sort/groupby   :      5000 -  . . . . .  -- 0.035
Test sort/zip       :      5000 -  . . . . .  -- 1.080
Test sort/izip      :      5000 -  . . . . .  -- 0.043
Test sort/tee/izip  :      5000 -  . . . . .  -- 0.031
Test moooeeeep      :      5000 -  . . . . .  -- 0.003 *
Test iter*/sorted   :      5000 -  . . . . .  -- 0.031

Test set len change :     50000 -  . . . . .  -- 0.035
Test in dict        :     50000 -  . . . . .  -- 0.023
Test in set         :     50000 -  . . . . .  -- 0.023
Test sort/adjacent  :     50000 -  . . . . .  -- 0.036
Test sort/groupby   :     50000 -  . . . . .  -- 0.134
Test sort/zip       :     50000 -  . . . . .  -- 1.121
Test sort/izip      :     50000 -  . . . . .  -- 0.054
Test sort/tee/izip  :     50000 -  . . . . .  -- 0.045
Test moooeeeep      :     50000 -  . . . . .  -- 0.019 *
Test iter*/sorted   :     50000 -  . . . . .  -- 0.055

Test set len change :    500000 -  . . . . .  -- 0.249
Test in dict        :    500000 -  . . . . .  -- 0.145
Test in set         :    500000 -  . . . . .  -- 0.165
Test sort/adjacent  :    500000 -  . . . . .  -- 0.139
Test sort/groupby   :    500000 -  . . . . .  -- 1.138
Test sort/zip       :    500000 -  . . . . .  -- 1.159
Test sort/izip      :    500000 -  . . . . .  -- 0.126
Test sort/tee/izip  :    500000 -  . . . . .  -- 0.120 *
Test moooeeeep      :    500000 -  . . . . .  -- 0.131
Test iter*/sorted   :    500000 -  . . . . .  -- 0.157

#6


1  

Another way of doing this succinctly is with Counter.

另一种简洁的方法是使用计数器。

To just determine if there are any duplicates in the original list:

确定原始列表中是否有重复:

from collections import Counter

def has_dupes(l):
    return Counter(l).most_common()[0][0] > 1

Or to get a list of items that have duplicates:

或获取具有重复项的项目列表:

def get_dupes(l):
    return [k for k, v in Counter(l).items() if v > 1]