Is there a built-in that removes duplicates from list in Python, whilst preserving order? I know that I can use a set to remove duplicates, but that destroys the original order. I also know that I can roll my own like this:

是否有内置功能可以从Python中的列表中删除重复项,同时保留顺序?我知道我可以使用一个集来删除重复项,但这会破坏原始顺序。我也知道我可以像这样滚动自己:

def uniq(input):  output = []  for x in input:    if x not in output:      output.append(x)  return output

(Thanks to unwind for that code sample.)

(感谢您放松该代码示例。)

But I'd like to avail myself of a built-in or a more Pythonic idiom if possible.

但是如果可能的话,我想利用内置或更多的Pythonic成语。

相关问题:在Python中,从列表中删除重复项的最快算法是什么,以便所有元素在保留顺序的同时是唯一的?

28 个解决方案

#1

653

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

在这里你有一些选择:http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):    seen = set()    seen_add = seen.add    return [x for x in seq if not (x in seen or seen_add(x))]

Why assign seen.add to seen_add instead of just calling seen.add? Python is a dynamic language, and resolving seen.add each iteration is more costly than resolving a local variable. seen.add could have changed between iterations, and the runtime isn't smart enough to rule that out. To play it safe, it has to check the object each time.

为什么将seen.add分配给seen_add而不是只调用seen.add? Python是一种动态语言,解析see.add每次迭代比解析局部变量更昂贵。 seen.add可能在迭代之间发生了变化,并且运行时不够聪明,无法排除这种情况。为了安全起见,每次都必须检查对象。

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

如果你计划在同一个数据集上大量使用这个函数,也许你最好使用一个有序集:http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

O(1)每次操作的插入,删除和成员检查。

#2

297

Edit 2016

As Raymond pointed out, in python 3.5+ where OrderedDict is implemented in C, the list comprehension approach will be slower than OrderedDict (unless you actually need the list at the end - and even then, only if the input is very short). So the best solution for 3.5+ is OrderedDict.

正如Raymond指出的那样,在使用C语言实现OrderedDict的python 3.5+中,列表推导方法将比OrderedDict慢(除非你实际上需要最后的列表 - 即使那时,只有输入非常短)。所以3.5+的最佳解决方案是OrderedDict。

Important Edit 2015

重要编辑2015

As @abarnert notes, the more_itertools library (pip install more_itertools) contains a unique_everseen function that is built to solve this problem without any unreadable (not seen.add) mutations in list comprehensions. This is also the fastest solution too:

正如@abarnert指出的那样,more_itertools库(pip install more_itertools)包含一个unique_everseen函数,该函数用于解决此问题,而列表推导中没有任何不可读(未见.add)的突变。这也是最快的解决方案:

>>> from  more_itertools import unique_everseen>>> items = [1, 2, 0, 1, 3, 2]>>> list(unique_everseen(items))[1, 2, 0, 3]

Just one simple library import and no hacks.This comes from an implementation of the itertools recipe unique_everseen which looks like:

只有一个简单的库导入而且没有hacks.This来自itertools recipe unique_everseen的实现,它看起来像:

def unique_everseen(iterable, key=None):    "List unique elements, preserving order. Remember all elements ever seen."    # unique_everseen('AAAABBBCCDAABBB') --> A B C D    # unique_everseen('ABBCcAD', str.lower) --> A B C D    seen = set()    seen_add = seen.add    if key is None:        for element in filterfalse(seen.__contains__, iterable):            seen_add(element)            yield element    else:        for element in iterable:            k = key(element)            if k not in seen:                seen_add(k)                yield element

In Python 2.7+ the ~~accepted common idiom~~ (which works but isn't optimized for speed, I would now use unique_everseen) for this uses collections.OrderedDict:

在Python 2.7+中,可接受的常用习惯用法(有效但未针对速度进行优化,我现在将使用unique_everseen)为此使用collections.OrderedDict:

Runtime: O(N)

>>> from collections import OrderedDict>>> items = [1, 2, 0, 1, 3, 2]>>> list(OrderedDict.fromkeys(items))[1, 2, 0, 3]

This looks much nicer than:

这看起来比以下更好:

seen = set()[x for x in seq if x not in seen and not seen.add(x)]

and doesn't utilize the ugly hack:

并没有利用丑陋的黑客:

not seen.add(x)

which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.

这取决于set.add是一个始终返回None的就地方法的事实,因此None计算为True。

Note however that the hack solution is faster in raw speed though it has the same runtime complexity O(N).

但请注意,虽然具有相同的运行时复杂度O(N),但黑客解决方案的原始速度更快。

#3

In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:

在Python 2.7中,从迭代中删除重复项同时保持原始顺序的新方法是:

>>> from collections import OrderedDict>>> list(OrderedDict.fromkeys('abracadabra'))['a', 'b', 'r', 'c', 'd']

In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

在Python 3.5中,OrderedDict有一个C实现。我的时间表明,现在这是Python 3.5的各种方法中最快和最短的。

In Python 3.6, the regular dict became both ordered and compact. (This feature is holds for CPython and PyPy but may not present in other implementations). That gives us a new fastest way of deduping while retaining order:

在Python 3.6中,常规字典变得有序且紧凑。 (此功能适用于CPython和PyPy,但在其他实现中可能不存在)。这为我们提供了一种新的最快的扣除方式,同时保留了订单:

>>> list(dict.fromkeys('abracadabra'))['a', 'b', 'r', 'c', 'd']

In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

在Python 3.7中,保证常规字典在所有实现中都有序。因此,最短,最快的解决方案是:

>>> list(dict.fromkeys('abracadabra'))['a', 'b', 'r', 'c', 'd']

Response to @max: Once you move to 3.6 or 3.7 and use the regular dict instead of OrderedDict, you can't really beat the performance in any other way. The dictionary is dense and readily converts to a list with almost no overhead. The target list is pre-sized to len(d) which saves all the resizes that occur in a list comprehension. Also, since the internal key list is dense, copying the pointers is about almost fast as a list copy.

对@max的响应:一旦你移动到3.6或3.7并使用常规字典而不是OrderedDict,你就无法以任何其他方式击败性能。字典很密集,很容易转换成几乎没有开销的列表。目标列表预先调整为len(d),其保存列表理解中出现的所有调整大小。此外,由于内部密钥列表是密集的,因此复制指针几乎快速作为列表副本。

#4

sequence = ['1', '2', '3', '3', '6', '4', '5', '6']unique = [][unique.append(item) for item in sequence if item not in unique]

unique → ['1', '2', '3', '6', '4', '5']

unique→['1','2','3','6','4','5']

#5

from itertools import groupby[ key for key,_ in groupby(sortedList)]

The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.

该列表甚至不必排序,充分条件是将相等的值组合在一起。

Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.

编辑:我认为“保留顺序”意味着列表实际上是有序的。如果不是这样,那么MizardX的解决方案是正确的。

Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".

社区编辑:然而,这是“将重复的连续元素压缩为单个元素”的最优雅方式。

#6

I think if you wanna maintain the order,

我想如果你想保持秩序,

you can try this:

list1 = ['b','c','d','b','c','a','a']    list2 = list(set(list1))    list2.sort(key=list1.index)    print list2

OR similarly you can do this:

list1 = ['b','c','d','b','c','a','a']  list2 = sorted(set(list1),key=list1.index)  print list2

You can also do this:

list1 = ['b','c','d','b','c','a','a']    list2 = []    for i in list1:        if not i in list2:          list2.append(i)`    print list2

It can also be written as this:

list1 = ['b','c','d','b','c','a','a']    list2 = []    [list2.append(i) for i in list1 if not i in list2]    print list2

#7

For another very late answer to another very old question:

对另一个非常古老的问题的另一个非常晚的答案:

The itertools recipes have a function that does this, using the seen set technique, but:

itertools配方有一个功能,使用看到的set技术,但是:

Handles a standard key function.

处理标准键功能。

Uses no unseemly hacks.

使用没有不合时宜的黑客。

Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don't.)

通过预绑定seen.add优化循环,而不是查找N次。 (f7也这样做,但有些版本没有。)

Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that's in C, and much faster.)

使用ifilterfalse优化循环,因此您只需循环遍历Python中的唯一元素,而不是所有元素。 (当然,你仍然会在ifilterfalse里面迭代所有这些,但是这在C中,并且要快得多。)

Is it actually faster than f7? It depends on your data, so you'll have to test it and see. If you want a list in the end, f7 uses a listcomp, and there's no way to do that here. (You can directly append instead of yielding, or you can feed the generator into the list function, but neither one can be as fast as the LIST_APPEND inside a listcomp.) At any rate, usually, squeezing out a few microseconds is not going to be as important as having an easily-understandable, reusable, already-written function that doesn't require DSU when you want to decorate.

它实际上比f7快吗?这取决于您的数据,因此您必须对其进行测试并查看。如果你想要一个列表,f7使用listcomp,这里就没办法了。 (你可以直接追加而不是屈服,或者你可以将生成器提供给列表函数,但是没有一个可以像listcomp中的LIST_APPEND一样快。)无论如何,通常,挤出几微秒不会与具有易于理解,可重复使用,已经编写的功能一样重要,当您想要装饰时不需要DSU。

As with all of the recipes, it's also available in more-iterools.

与所有配方一样,它也可以在更多的iterools中使用。

If you just want the no-key case, you can simplify it as:

如果您只想要无键情况,可以将其简化为:

def unique(iterable):    seen = set()    seen_add = seen.add    for element in itertools.ifilterfalse(seen.__contains__, iterable):        seen_add(element)        yield element

#8

Just to add another (very performant) implementation of such a functionality from an external module¹: iteration_utilities.unique_everseen:

只是从外部模块1添加另一个(非常高效的)此类功能的实现:iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen>>> lst = [1,1,1,2,3,2,2,2,1,3,4]>>> list(unique_everseen(lst))[1, 2, 3, 4]

Timings

I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:

我做了一些时间(Python 3.6)并且这些表明它比我测试的所有其他选项更快,包括OrderedDict.fromkeys,f7和more_itertools.unique_everseen:

%matplotlib notebookfrom iteration_utilities import unique_everseenfrom collections import OrderedDictfrom more_itertools import unique_everseen as mi_unique_everseendef f7(seq):    seen = set()    seen_add = seen.add    return [x for x in seq if not (x in seen or seen_add(x))]def iteration_utilities_unique_everseen(seq):    return list(unique_everseen(seq))def more_itertools_unique_everseen(seq):    return list(mi_unique_everseen(seq))def odict(seq):    return list(OrderedDict.fromkeys(seq))from simple_benchmark import benchmarkb = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],              {2**i: list(range(2**i)) for i in range(1, 20)},              'list size (no duplicates)')b.plot()

And just to make sure I also did a test with more duplicates just to check if it makes a difference:

并且只是为了确保我还进行了更多重复的测试,以检查它是否有所不同:

import randomb = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],              {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},              'list size (lots of duplicates)')b.plot()

And one containing only one value:

一个只包含一个值:

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],              {2**i: [1]*(2**i) for i in range(1, 20)},              'list size (only duplicates)')b.plot()

In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).

在所有这些情况下,iteration_utilities.unique_everseen函数是最快的(在我的计算机上)。

This iteration_utilities.unique_everseen function can also handle unhashable values in the input (however with an O(n*n) performance instead of the O(n) performance when the values are hashable).

此iteration_utilities.unique_everseen函数还可以处理输入中的不可恢复的值(但是,当值是可清除时,具有O(n * n)性能而不是O(n)性能)。

>>> lst = [{1}, {1}, {2}, {1}, {3}]>>> list(unique_everseen(lst))[{1}, {2}, {3}]

¹ Disclaimer: I'm the author of that package.

1免责声明:我是该套餐的作者。

#9

Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.

不要踢死马(这个问题很老,而且已经有很多好的答案),但这里有一个使用熊猫的解决方案,在很多情况下都很快,并且使用起来很简单。

import pandas as pdmy_list = my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]>>> pd.Series(my_list).drop_duplicates().tolist()# Output:# [0, 1, 2, 3, 4, 5]

#10

For no hashable types (e.g. list of lists), based on MizardX's:

对于没有可清除类型(例如列表列表),基于MizardX:

def f7_noHash(seq)    seen = set()    return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]

#11

Borrowing the recursive idea used in definining Haskell's nub function for lists, this would be a recursive approach:

借用用于为列表定义Haskell的nub函数的递归思想,这将是一种递归方法:

def unique(lst):    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))

e.g.:

In [118]: unique([1,5,1,1,4,3,4])Out[118]: [1, 5, 4, 3]

I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).

我尝试用它来增加数据大小并看到亚线性时间复杂度(不是确定的,但建议这应该适用于普通数据)。

In [122]: %timeit unique(np.random.randint(5, size=(1)))10000 loops, best of 3: 25.3 us per loopIn [123]: %timeit unique(np.random.randint(5, size=(10)))10000 loops, best of 3: 42.9 us per loopIn [124]: %timeit unique(np.random.randint(5, size=(100)))10000 loops, best of 3: 132 us per loopIn [125]: %timeit unique(np.random.randint(5, size=(1000)))1000 loops, best of 3: 1.05 ms per loopIn [126]: %timeit unique(np.random.randint(5, size=(10000)))100 loops, best of 3: 11 ms per loop

I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:

我也认为有趣的是,这可以很容易地被其他操作推广到唯一性。像这样:

import operatordef unique(lst, cmp_op=operator.ne):    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)

For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:

例如,您可以将一个使用舍入概念的函数传递给相同的整数,就好像它是唯一性的“相等”一样,如下所示:

def test_round(x,y):    return round(x) != round(y)

then unique(some_list, test_round) would provide the unique elements of the list where uniqueness no longer meant traditional equality (which is implied by using any sort of set-based or dict-key-based approach to this problem) but instead meant to take only the first element that rounds to K for each possible integer K that the elements might round to, e.g.:

那么unique(some_list,test_round)将提供列表中的唯一元素,其中唯一性不再意味着传统的平等(通过使用任何基于集合或基于dict-key的方法来暗示这个问题),而是意味着采取对于元素可能舍入到的每个可能的整数K,只有第一个向K舍入的元素,例如:

In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)Out[6]: [1.2, 5, 1.9, 4.2, 3]

#12

5 x faster reduce variant but more sophisticated

减速5倍,但更复杂

>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0][5, 6, 1, 2, 3, 4]

Explanation:

default = (list(), set())# use list to keep order# use set to make lookup fasterdef reducer(result, item):    if item not in result[1]:        result[0].append(item)        result[1].add(item)    return result>>> reduce(reducer, l, default)[0][5, 6, 1, 2, 3, 4]

#13

You can reference a list comprehension as it is being built by the symbol '_[1]'.
For example, the following function unique-ifies a list of elements without changing their order by referencing its list comprehension.

您可以引用列表推导,因为它是由符号'_ [1]'构建的。例如,以下函数unique-ifies元素列表,而不通过引用其列表推导来更改其顺序。

def unique(my_list):     return [x for x in my_list if x not in locals()['_[1]']]

Demo:

l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]l2 = [x for x in l1 if x not in locals()['_[1]']]print l2

Output:

[1, 2, 3, 4, 5]

#14

MizardX's answer gives a good collection of multiple approaches.

MizardX的答案提供了多种方法的良好集合。

This is what I came up with while thinking aloud:

这是我在大声思考时想出的:

mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]

#15

You could do a sort of ugly list comprehension hack.

你可以做一些丑陋的列表理解黑客。

[l[i] for i in range(len(l)) if l.index(l[i]) == i]

#16

Relatively effective approach with _sorted_ a numpy arrays:

使用_sorted_ a numpy数组的相对有效的方法:

b = np.array([1,3,3, 8, 12, 12,12])    numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])

Outputs:

array([ 1,  3,  8, 12])

#17

l = [1,2,2,3,3,...]n = []n.extend(ele for ele in l if ele not in set(n))

A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.

一个生成器表达式,它使用O(1)查找集合来确定是否在新列表中包含元素。

#18

A simple recursive solution:

一个简单的递归解决方案:

def uniquefy_list(a):    return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]

#19

In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.

在Python 3.7及更高版本中,字典可以保证记住它们的密钥插入顺序。这个问题的答案总结了当前的情况。

The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:

OrderedDict解决方案因此变得过时,没有任何import语句我们可以简单地发出:

>>> list(dict.fromkeys([1, 2, 1, 3, 3, 2, 4]).keys())[1, 2, 3, 4]

#20

If you need one liner then maybe this would help:

如果你需要一个衬垫,那么这可能会有所帮助:

reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))

... should work but correct me if i'm wrong

......如果我错了,应该工作但是纠正我

#21

If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:

如果您经常使用pandas,并且美观性优于性能,那么请考虑内置函数pandas.Series.drop_duplicates:

    import pandas as pd    import numpy as np    uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()    # from the chosen answer     def f7(seq):        seen = set()        seen_add = seen.add        return [ x for x in seq if not (x in seen or seen_add(x))]    alist = np.random.randint(low=0, high=1000, size=10000).tolist()    print uniquifier(alist) == f7(alist)  # True

Timing:

    In [104]: %timeit f7(alist)    1000 loops, best of 3: 1.3 ms per loop    In [110]: %timeit uniquifier(alist)    100 loops, best of 3: 4.39 ms per loop

#22

this will preserve order and run in O(n) time. basically the idea is to create a hole wherever there is a duplicate found and sink it down to the bottom. makes use of a read and write pointer. whenever a duplicate is found only the read pointer advances and write pointer stays on the duplicate entry to overwrite it.

这将保留秩序并在O(n)时间内运行。基本上这个想法是在找到重复的地方创建一个洞并将其下沉到底部。利用读写指针。每当发现重复时,只有读指针前进,并且写指针停留在重复条目上以覆盖它。

def deduplicate(l):    count = {}    (read,write) = (0,0)    while read < len(l):        if l[read] in count:            read += 1            continue        count[l[read]] = True        l[write] = l[read]        read += 1        write += 1    return l[0:write]

#23

A solution without using imported modules or sets:

不使用导入的模块或集合的解决方案:

text = "ask not what your country can do for you ask what you can do for your country"sentence = text.split(" ")noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]print(noduplicates)

Gives output:

['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']

#24

-1

Because I was looking at a dup and collected some related but different, related, useful information that isn't part of the other answers, here are two other possible solutions.

因为我正在查看dup并收集了一些相关但不同的,相关的,有用的信息,这些信息不是其他答案的一部分,这里有两个其他可能的解决方案。

.get(True) XOR .setdefault(False)

.get(True)XOR .setdefault(False)

The first is very much like the accepted seen_add soultion but with explicit side effects using dictionary's get(x,<default>) and setdefault(x,<default>):

第一个非常像被接受的seen_add灵魂,但使用字典的get(x, )和setdefault(x, )有明显的副作用:

# Explanation of d.get(x,True) != d.setdefault(x,False)## x in d | d[x]  | A = d.get(x,True) | x in d | B = d.setdefault(x,False) | x in d | d[x]    | A xor B# False  | None  | True          (1) | False  | False                 (2) | True   | False   | True# True   | False | False         (3) | True   | False                 (4) | True   | False   | False## Notes# (1) x is not in the dictionary, so get(x,<default>) returns True but does __not__ add the value to the dictionary# (2) x is not in the dictionary, so setdefault(x,<default>) adds the {x:False} and returns False# (3) since x is in the dictionary, the <default> argument is ignored, and the value of the key is returned, which was#     set to False in (2)# (4) since the key is already in the dictionary, its value is returned directly and the argument is ignored## A != B is how to do boolean XOR in Python#def sort_with_order(s):    d = dict()    return [x for x in s if d.get(x,True) != d.setdefault(x,False)]

get(x,<default>) returns <default> if x is not in the dictionary, but does not add the key to the dictionary. set(x,<default>) returns the value if the key is in the dictionary, otherwise sets it to and returns <default>.

如果x不在字典中,则get(x, )返回 ,但不将该键添加到字典中。如果键在字典中,则set(x, )返回值,否则将其设置为并返回。

Aside: a != b is how to do an XOR in python

旁白:a!= b是如何在python中进行异或

__OVERRIDING ___missing_____ (inspired by this answer)

__OVERRIDING ___missing_____(灵感来自这个答案)

The second technique is overriding the __missing__ method that gets called when the key doesn't exist in a dictionary, which is only called when using d[k] notation:

第二种技术是覆盖在字典中不存在键时调用的__missing__方法,该方法仅在使用d [k]表示法时调用:

class Tracker(dict):    # returns True if missing, otherwise sets the value to False    # so next time d[key] is called, the value False will be returned    # and __missing__ will not be called again    def __missing__(self, key):        self[key] = False        return Truet = Tracker()unique_with_order = [x for x in samples if t[x]]

From the docs:

来自文档:

New in version 2.5: If a subclass of dict defines a method _____missing_____(), if the key key is not present, the d[key] operation calls that method with the key key as argument. The d[key] operation then returns or raises whatever is returned or raised by the _____missing_____(key) call if the key is not present. No other operations or methods invoke _____missing_____(). If _____missing_____() is not defined, KeyError is raised. _____missing_____() must be a method; it cannot be an instance variable. For an example, see collections.defaultdict.

版本2.5中的新增内容:如果dict的子类定义了_____缺少_____()的方法,则如果不存在密钥,则d [key]操作将使用密钥作为参数调用该方法。然后,如果密钥不存在,则d [key]操作返回或引发_____缺少_____(密钥)呼叫返回或提出的任何内容。没有其他操作或方法调用_____缺少_____()。如果_____缺少_____()未定义,则引发KeyError。 _____失踪_____()必须是一种方法;它不能是实例变量。有关示例,请参阅collections.defaultdict。

#25

-1

Here is my 2 cents on this:

这是我的2美分:

def unique(nums):    unique = []    for n in nums:        if n not in unique:            unique.append(n)    return unique

Regards,Yuriy

#26

-1

this is the smartes way to remove duplicates from a list in Python whilst preserving its order, you can even do it in one line of code:

这是从Python中的列表中删除重复项同时保留其顺序的智能方法,您甚至可以在一行代码中执行此操作:

a_list = ["a", "b", "a", "c"]sorted_list = [x[0] for x in (sorted({x:a_list.index(x) for x in set(a_list)}.items(), key=lambda x: x[1]))]print sorted_list

#27

-1

My buddy Wes gave me this sweet answer using list comprehensions.

我的伙伴Wes使用列表推导给了我这个甜蜜的答案。

Example Code:

>>> l = [3, 4, 3, 6, 4, 1, 4, 8]>>> l = [l[i] for i in range(len(l)) if i == l.index(l[i])]>>> l = [3, 4, 6, 1, 8]

#28

-1

Just to add another answer I've not seen listed

只是添加另一个我没见过的答案

>>> a = ['f', 'F', 'F', 'G', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'f']>>> [a[i] for i in sorted(set([a.index(elem) for elem in a]))]['f', 'F', 'G', 'a', 'b', 'c', 'd']>>>

This is using .index to get the first index of every list element, and getting rid of duplicate results (for repeating elements) with set, then sorting because there's no order in sets. Note that we do not loose order information because the first index of every new element is always in ascending order. So sorted will always put it right.

这是使用.index获取每个列表元素的第一个索引,并使用set去除重复结果(对于重复元素),然后排序,因为集合中没有顺序。请注意,我们不会丢失订单信息,因为每个新元素的第一个索引始终按升序排列。如此排序将始终正确。

I've just considered the easy syntax, not performance.

我刚刚考虑了简单的语法,而不是性能。

#1

653

Here you have some alternatives: http://www.peterbe.com/plog/uniqifiers-benchmark

在这里你有一些选择:http://www.peterbe.com/plog/uniqifiers-benchmark

Fastest one:

def f7(seq):    seen = set()    seen_add = seen.add    return [x for x in seq if not (x in seen or seen_add(x))]

If you plan on using this function a lot on the same dataset, perhaps you would be better off with an ordered set: http://code.activestate.com/recipes/528878/

如果你计划在同一个数据集上大量使用这个函数,也许你最好使用一个有序集:http://code.activestate.com/recipes/528878/

O(1) insertion, deletion and member-check per operation.

O(1)每次操作的插入,删除和成员检查。

#2

297

Edit 2016

Important Edit 2015

重要编辑2015

>>> from  more_itertools import unique_everseen>>> items = [1, 2, 0, 1, 3, 2]>>> list(unique_everseen(items))[1, 2, 0, 3]

Just one simple library import and no hacks.This comes from an implementation of the itertools recipe unique_everseen which looks like:

只有一个简单的库导入而且没有hacks.This来自itertools recipe unique_everseen的实现,它看起来像:

def unique_everseen(iterable, key=None):    "List unique elements, preserving order. Remember all elements ever seen."    # unique_everseen('AAAABBBCCDAABBB') --> A B C D    # unique_everseen('ABBCcAD', str.lower) --> A B C D    seen = set()    seen_add = seen.add    if key is None:        for element in filterfalse(seen.__contains__, iterable):            seen_add(element)            yield element    else:        for element in iterable:            k = key(element)            if k not in seen:                seen_add(k)                yield element

In Python 2.7+ the ~~accepted common idiom~~ (which works but isn't optimized for speed, I would now use unique_everseen) for this uses collections.OrderedDict:

在Python 2.7+中,可接受的常用习惯用法(有效但未针对速度进行优化,我现在将使用unique_everseen)为此使用collections.OrderedDict:

Runtime: O(N)

>>> from collections import OrderedDict>>> items = [1, 2, 0, 1, 3, 2]>>> list(OrderedDict.fromkeys(items))[1, 2, 0, 3]

This looks much nicer than:

这看起来比以下更好:

seen = set()[x for x in seq if x not in seen and not seen.add(x)]

and doesn't utilize the ugly hack:

并没有利用丑陋的黑客:

not seen.add(x)

which relies on the fact that set.add is an in-place method that always returns None so not None evaluates to True.

这取决于set.add是一个始终返回None的就地方法的事实,因此None计算为True。

Note however that the hack solution is faster in raw speed though it has the same runtime complexity O(N).

但请注意,虽然具有相同的运行时复杂度O(N),但黑客解决方案的原始速度更快。

#3

In Python 2.7, the new way of removing duplicates from an iterable while keeping it in the original order is:

在Python 2.7中,从迭代中删除重复项同时保持原始顺序的新方法是:

>>> from collections import OrderedDict>>> list(OrderedDict.fromkeys('abracadabra'))['a', 'b', 'r', 'c', 'd']

In Python 3.5, the OrderedDict has a C implementation. My timings show that this is now both the fastest and shortest of the various approaches for Python 3.5.

在Python 3.5中,OrderedDict有一个C实现。我的时间表明,现在这是Python 3.5的各种方法中最快和最短的。

>>> list(dict.fromkeys('abracadabra'))['a', 'b', 'r', 'c', 'd']

In Python 3.7, the regular dict is guaranteed to both ordered across all implementations. So, the shortest and fastest solution is:

在Python 3.7中,保证常规字典在所有实现中都有序。因此,最短,最快的解决方案是:

>>> list(dict.fromkeys('abracadabra'))['a', 'b', 'r', 'c', 'd']

#4

sequence = ['1', '2', '3', '3', '6', '4', '5', '6']unique = [][unique.append(item) for item in sequence if item not in unique]

unique → ['1', '2', '3', '6', '4', '5']

unique→['1','2','3','6','4','5']

#5

from itertools import groupby[ key for key,_ in groupby(sortedList)]

The list doesn't even have to be sorted, the sufficient condition is that equal values are grouped together.

该列表甚至不必排序,充分条件是将相等的值组合在一起。

Edit: I assumed that "preserving order" implies that the list is actually ordered. If this is not the case, then the solution from MizardX is the right one.

编辑:我认为“保留顺序”意味着列表实际上是有序的。如果不是这样,那么MizardX的解决方案是正确的。

Community edit: This is however the most elegant way to "compress duplicate consecutive elements into a single element".

社区编辑:然而,这是“将重复的连续元素压缩为单个元素”的最优雅方式。

#6

I think if you wanna maintain the order,

我想如果你想保持秩序,

you can try this:

list1 = ['b','c','d','b','c','a','a']    list2 = list(set(list1))    list2.sort(key=list1.index)    print list2

OR similarly you can do this:

list1 = ['b','c','d','b','c','a','a']  list2 = sorted(set(list1),key=list1.index)  print list2

You can also do this:

list1 = ['b','c','d','b','c','a','a']    list2 = []    for i in list1:        if not i in list2:          list2.append(i)`    print list2

It can also be written as this:

list1 = ['b','c','d','b','c','a','a']    list2 = []    [list2.append(i) for i in list1 if not i in list2]    print list2

#7

For another very late answer to another very old question:

对另一个非常古老的问题的另一个非常晚的答案:

The itertools recipes have a function that does this, using the seen set technique, but:

itertools配方有一个功能,使用看到的set技术,但是:

Handles a standard key function.

处理标准键功能。

Uses no unseemly hacks.

使用没有不合时宜的黑客。

Optimizes the loop by pre-binding seen.add instead of looking it up N times. (f7 also does this, but some versions don't.)

通过预绑定seen.add优化循环,而不是查找N次。 (f7也这样做,但有些版本没有。)

Optimizes the loop by using ifilterfalse, so you only have to loop over the unique elements in Python, instead of all of them. (You still iterate over all of them inside ifilterfalse, of course, but that's in C, and much faster.)

As with all of the recipes, it's also available in more-iterools.

与所有配方一样,它也可以在更多的iterools中使用。

If you just want the no-key case, you can simplify it as:

如果您只想要无键情况,可以将其简化为:

def unique(iterable):    seen = set()    seen_add = seen.add    for element in itertools.ifilterfalse(seen.__contains__, iterable):        seen_add(element)        yield element

#8

Just to add another (very performant) implementation of such a functionality from an external module¹: iteration_utilities.unique_everseen:

只是从外部模块1添加另一个(非常高效的)此类功能的实现:iteration_utilities.unique_everseen:

>>> from iteration_utilities import unique_everseen>>> lst = [1,1,1,2,3,2,2,2,1,3,4]>>> list(unique_everseen(lst))[1, 2, 3, 4]

Timings

I did some timings (Python 3.6) and these show that it's faster than all other alternatives I tested, including OrderedDict.fromkeys, f7 and more_itertools.unique_everseen:

我做了一些时间(Python 3.6)并且这些表明它比我测试的所有其他选项更快,包括OrderedDict.fromkeys,f7和more_itertools.unique_everseen:

%matplotlib notebookfrom iteration_utilities import unique_everseenfrom collections import OrderedDictfrom more_itertools import unique_everseen as mi_unique_everseendef f7(seq):    seen = set()    seen_add = seen.add    return [x for x in seq if not (x in seen or seen_add(x))]def iteration_utilities_unique_everseen(seq):    return list(unique_everseen(seq))def more_itertools_unique_everseen(seq):    return list(mi_unique_everseen(seq))def odict(seq):    return list(OrderedDict.fromkeys(seq))from simple_benchmark import benchmarkb = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],              {2**i: list(range(2**i)) for i in range(1, 20)},              'list size (no duplicates)')b.plot()

And just to make sure I also did a test with more duplicates just to check if it makes a difference:

并且只是为了确保我还进行了更多重复的测试,以检查它是否有所不同:

import randomb = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],              {2**i: [random.randint(0, 2**(i-1)) for _ in range(2**i)] for i in range(1, 20)},              'list size (lots of duplicates)')b.plot()

And one containing only one value:

一个只包含一个值:

b = benchmark([f7, iteration_utilities_unique_everseen, more_itertools_unique_everseen, odict],              {2**i: [1]*(2**i) for i in range(1, 20)},              'list size (only duplicates)')b.plot()

In all of these cases the iteration_utilities.unique_everseen function is the fastest (on my computer).

在所有这些情况下,iteration_utilities.unique_everseen函数是最快的(在我的计算机上)。

此iteration_utilities.unique_everseen函数还可以处理输入中的不可恢复的值(但是,当值是可清除时,具有O(n * n)性能而不是O(n)性能)。

>>> lst = [{1}, {1}, {2}, {1}, {3}]>>> list(unique_everseen(lst))[{1}, {2}, {3}]

¹ Disclaimer: I'm the author of that package.

1免责声明:我是该套餐的作者。

#9

Not to kick a dead horse (this question is very old and already has lots of good answers), but here is a solution using pandas that is quite fast in many circumstances and is dead simple to use.

不要踢死马(这个问题很老,而且已经有很多好的答案),但这里有一个使用熊猫的解决方案,在很多情况下都很快,并且使用起来很简单。

import pandas as pdmy_list = my_list = [0, 1, 2, 3, 4, 1, 2, 3, 5]>>> pd.Series(my_list).drop_duplicates().tolist()# Output:# [0, 1, 2, 3, 4, 5]

#10

For no hashable types (e.g. list of lists), based on MizardX's:

对于没有可清除类型(例如列表列表),基于MizardX:

def f7_noHash(seq)    seen = set()    return [ x for x in seq if str( x ) not in seen and not seen.add( str( x ) )]

#11

Borrowing the recursive idea used in definining Haskell's nub function for lists, this would be a recursive approach:

借用用于为列表定义Haskell的nub函数的递归思想,这将是一种递归方法:

def unique(lst):    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: x!= lst[0], lst[1:]))

e.g.:

In [118]: unique([1,5,1,1,4,3,4])Out[118]: [1, 5, 4, 3]

I tried it for growing data sizes and saw sub-linear time-complexity (not definitive, but suggests this should be fine for normal data).

我尝试用它来增加数据大小并看到亚线性时间复杂度(不是确定的,但建议这应该适用于普通数据)。

In [122]: %timeit unique(np.random.randint(5, size=(1)))10000 loops, best of 3: 25.3 us per loopIn [123]: %timeit unique(np.random.randint(5, size=(10)))10000 loops, best of 3: 42.9 us per loopIn [124]: %timeit unique(np.random.randint(5, size=(100)))10000 loops, best of 3: 132 us per loopIn [125]: %timeit unique(np.random.randint(5, size=(1000)))1000 loops, best of 3: 1.05 ms per loopIn [126]: %timeit unique(np.random.randint(5, size=(10000)))100 loops, best of 3: 11 ms per loop

I also think it's interesting that this could be readily generalized to uniqueness by other operations. Like this:

我也认为有趣的是,这可以很容易地被其他操作推广到唯一性。像这样:

import operatordef unique(lst, cmp_op=operator.ne):    return [] if lst==[] else [lst[0]] + unique(filter(lambda x: cmp_op(x, lst[0]), lst[1:]), cmp_op)

For example, you could pass in a function that uses the notion of rounding to the same integer as if it was "equality" for uniqueness purposes, like this:

例如,您可以将一个使用舍入概念的函数传递给相同的整数,就好像它是唯一性的“相等”一样,如下所示:

def test_round(x,y):    return round(x) != round(y)

In [6]: unique([1.2, 5, 1.9, 1.1, 4.2, 3, 4.8], test_round)Out[6]: [1.2, 5, 1.9, 4.2, 3]

#12

5 x faster reduce variant but more sophisticated

减速5倍,但更复杂

>>> l = [5, 6, 6, 1, 1, 2, 2, 3, 4]>>> reduce(lambda r, v: v in r[1] and r or (r[0].append(v) or r[1].add(v)) or r, l, ([], set()))[0][5, 6, 1, 2, 3, 4]

Explanation:

default = (list(), set())# use list to keep order# use set to make lookup fasterdef reducer(result, item):    if item not in result[1]:        result[0].append(item)        result[1].add(item)    return result>>> reduce(reducer, l, default)[0][5, 6, 1, 2, 3, 4]

#13

您可以引用列表推导,因为它是由符号'_ [1]'构建的。例如,以下函数unique-ifies元素列表,而不通过引用其列表推导来更改其顺序。

def unique(my_list):     return [x for x in my_list if x not in locals()['_[1]']]

Demo:

l1 = [1, 2, 3, 4, 1, 2, 3, 4, 5]l2 = [x for x in l1 if x not in locals()['_[1]']]print l2

Output:

[1, 2, 3, 4, 5]

#14

MizardX's answer gives a good collection of multiple approaches.

MizardX的答案提供了多种方法的良好集合。

This is what I came up with while thinking aloud:

这是我在大声思考时想出的:

mylist = [x for i,x in enumerate(mylist) if x not in mylist[i+1:]]

#15

You could do a sort of ugly list comprehension hack.

你可以做一些丑陋的列表理解黑客。

[l[i] for i in range(len(l)) if l.index(l[i]) == i]

#16

Relatively effective approach with _sorted_ a numpy arrays:

使用_sorted_ a numpy数组的相对有效的方法:

b = np.array([1,3,3, 8, 12, 12,12])    numpy.hstack([b[0], [x[0] for x in zip(b[1:], b[:-1]) if x[0]!=x[1]]])

Outputs:

array([ 1,  3,  8, 12])

#17

l = [1,2,2,3,3,...]n = []n.extend(ele for ele in l if ele not in set(n))

A generator expression that uses the O(1) look up of a set to determine whether or not to include an element in the new list.

一个生成器表达式,它使用O(1)查找集合来确定是否在新列表中包含元素。

#18

A simple recursive solution:

一个简单的递归解决方案:

def uniquefy_list(a):    return uniquefy_list(a[1:]) if a[0] in a[1:] else [a[0]]+uniquefy_list(a[1:]) if len(a)>1 else [a[0]]

#19

In Python 3.7 and above, dictionaries are guaranteed to remember their key insertion order. The answer to this question summarizes the current state of affairs.

在Python 3.7及更高版本中,字典可以保证记住它们的密钥插入顺序。这个问题的答案总结了当前的情况。

The OrderedDict solution thus becomes obsolete and without any import statements we can simply issue:

OrderedDict解决方案因此变得过时,没有任何import语句我们可以简单地发出:

>>> list(dict.fromkeys([1, 2, 1, 3, 3, 2, 4]).keys())[1, 2, 3, 4]

#20

If you need one liner then maybe this would help:

如果你需要一个衬垫,那么这可能会有所帮助:

reduce(lambda x, y: x + y if y[0] not in x else x, map(lambda x: [x],lst))

... should work but correct me if i'm wrong

......如果我错了,应该工作但是纠正我

#21

If you routinely use pandas, and aesthetics is preferred over performance, then consider the built-in function pandas.Series.drop_duplicates:

如果您经常使用pandas,并且美观性优于性能,那么请考虑内置函数pandas.Series.drop_duplicates:

    import pandas as pd    import numpy as np    uniquifier = lambda alist: pd.Series(alist).drop_duplicates().tolist()    # from the chosen answer     def f7(seq):        seen = set()        seen_add = seen.add        return [ x for x in seq if not (x in seen or seen_add(x))]    alist = np.random.randint(low=0, high=1000, size=10000).tolist()    print uniquifier(alist) == f7(alist)  # True

Timing:

    In [104]: %timeit f7(alist)    1000 loops, best of 3: 1.3 ms per loop    In [110]: %timeit uniquifier(alist)    100 loops, best of 3: 4.39 ms per loop

#22

def deduplicate(l):    count = {}    (read,write) = (0,0)    while read < len(l):        if l[read] in count:            read += 1            continue        count[l[read]] = True        l[write] = l[read]        read += 1        write += 1    return l[0:write]

#23

A solution without using imported modules or sets:

不使用导入的模块或集合的解决方案:

text = "ask not what your country can do for you ask what you can do for your country"sentence = text.split(" ")noduplicates = [(sentence[i]) for i in range (0,len(sentence)) if sentence[i] not in sentence[:i]]print(noduplicates)

Gives output:

['ask', 'not', 'what', 'your', 'country', 'can', 'do', 'for', 'you']

#24

-1

Because I was looking at a dup and collected some related but different, related, useful information that isn't part of the other answers, here are two other possible solutions.

因为我正在查看dup并收集了一些相关但不同的,相关的,有用的信息,这些信息不是其他答案的一部分,这里有两个其他可能的解决方案。

.get(True) XOR .setdefault(False)

.get(True)XOR .setdefault(False)

The first is very much like the accepted seen_add soultion but with explicit side effects using dictionary's get(x,<default>) and setdefault(x,<default>):

第一个非常像被接受的seen_add灵魂,但使用字典的get(x, )和setdefault(x, )有明显的副作用:

# Explanation of d.get(x,True) != d.setdefault(x,False)## x in d | d[x]  | A = d.get(x,True) | x in d | B = d.setdefault(x,False) | x in d | d[x]    | A xor B# False  | None  | True          (1) | False  | False                 (2) | True   | False   | True# True   | False | False         (3) | True   | False                 (4) | True   | False   | False## Notes# (1) x is not in the dictionary, so get(x,<default>) returns True but does __not__ add the value to the dictionary# (2) x is not in the dictionary, so setdefault(x,<default>) adds the {x:False} and returns False# (3) since x is in the dictionary, the <default> argument is ignored, and the value of the key is returned, which was#     set to False in (2)# (4) since the key is already in the dictionary, its value is returned directly and the argument is ignored## A != B is how to do boolean XOR in Python#def sort_with_order(s):    d = dict()    return [x for x in s if d.get(x,True) != d.setdefault(x,False)]

如果x不在字典中,则get(x, )返回 ,但不将该键添加到字典中。如果键在字典中,则set(x, )返回值,否则将其设置为并返回。

Aside: a != b is how to do an XOR in python

旁白:a!= b是如何在python中进行异或

__OVERRIDING ___missing_____ (inspired by this answer)

__OVERRIDING ___missing_____(灵感来自这个答案)

The second technique is overriding the __missing__ method that gets called when the key doesn't exist in a dictionary, which is only called when using d[k] notation:

第二种技术是覆盖在字典中不存在键时调用的__missing__方法,该方法仅在使用d [k]表示法时调用:

class Tracker(dict):    # returns True if missing, otherwise sets the value to False    # so next time d[key] is called, the value False will be returned    # and __missing__ will not be called again    def __missing__(self, key):        self[key] = False        return Truet = Tracker()unique_with_order = [x for x in samples if t[x]]

From the docs:

来自文档:

New in version 2.5: If a subclass of dict defines a method _____missing_____(), if the key key is not present, the d[key] operation calls that method with the key key as argument. The d[key] operation then returns or raises whatever is returned or raised by the _____missing_____(key) call if the key is not present. No other operations or methods invoke _____missing_____(). If _____missing_____() is not defined, KeyError is raised. _____missing_____() must be a method; it cannot be an instance variable. For an example, see collections.defaultdict.

版本2.5中的新增内容:如果dict的子类定义了_____缺少_____()的方法,则如果不存在密钥,则d [key]操作将使用密钥作为参数调用该方法。然后,如果密钥不存在,则d [key]操作返回或引发_____缺少_____(密钥)呼叫返回或提出的任何内容。没有其他操作或方法调用_____缺少_____()。如果_____缺少_____()未定义,则引发KeyError。 _____失踪_____()必须是一种方法;它不能是实例变量。有关示例,请参阅collections.defaultdict。

#25

-1

Here is my 2 cents on this:

这是我的2美分:

def unique(nums):    unique = []    for n in nums:        if n not in unique:            unique.append(n)    return unique

Regards,Yuriy

#26

-1

this is the smartes way to remove duplicates from a list in Python whilst preserving its order, you can even do it in one line of code:

这是从Python中的列表中删除重复项同时保留其顺序的智能方法,您甚至可以在一行代码中执行此操作:

a_list = ["a", "b", "a", "c"]sorted_list = [x[0] for x in (sorted({x:a_list.index(x) for x in set(a_list)}.items(), key=lambda x: x[1]))]print sorted_list

#27

-1

My buddy Wes gave me this sweet answer using list comprehensions.

我的伙伴Wes使用列表推导给了我这个甜蜜的答案。

Example Code:

>>> l = [3, 4, 3, 6, 4, 1, 4, 8]>>> l = [l[i] for i in range(len(l)) if i == l.index(l[i])]>>> l = [3, 4, 6, 1, 8]

#28

-1

Just to add another answer I've not seen listed

只是添加另一个我没见过的答案

>>> a = ['f', 'F', 'F', 'G', 'a', 'b', 'b', 'c', 'd', 'd', 'd', 'f']>>> [a[i] for i in sorted(set([a.index(elem) for elem in a]))]['f', 'F', 'G', 'a', 'b', 'c', 'd']>>>

I've just considered the easy syntax, not performance.

我刚刚考虑了简单的语法,而不是性能。

如何在保留订单的同时从列表中删除重复项？

28 个解决方案

#1

#2

#3

#4

#5

#6

you can try this:

OR similarly you can do this:

You can also do this:

It can also be written as this:

#7

#8

Timings

#9

#10

#11

#12

#13

#14

#15

#16

#17

#18

#19

#20

#21

#22

#23

#24

#25

#26

#27

#28

#1

#2

#3

#4

#5

#6

you can try this:

OR similarly you can do this:

You can also do this:

It can also be written as this:

#7

#8

Timings

#9

#10

#11

#12

#13

#14

#15

#16

#17

#18

#19

#20

#21

#22

#23

#24

#25

#26

#27

#28

相关文章