大多数pythonic方式计算可迭代的匹配元素

I have an iterable of entries on which I would like to gather some simple statistics, say the count of all numbers divisible by two and the count of all numbers divisible by three.

我有一个可迭代的条目,我想收集一些简单的统计数据,比如可以被2整除的所有数字的数量以及可被3整除的所有数字的数量。

My first alternative, While only iterating through the list once and avoiding the list expansion (and keeping the split loop refactoring in mind), looks rather bloated:

我的第一个选择,虽然只迭代列表一次并避免列表扩展(并保持拆分循环重构),看起来相当臃肿:

(alt 1)

r = xrange(1, 10)

twos = 0
threes = 0

for v in r:
  if v % 2 == 0:
    twos+=1
  if v % 3 == 0:
    threes+=1

print twos
print threes

This looks rather nice, but has the drawback of expanding the expression to a list:

这看起来相当不错,但是有将表达式扩展到列表的缺点:

(alt 2)

r = xrange(1, 10)

print len([1 for v in r if v % 2 == 0])
print len([1 for v in r if v % 3 == 0])

What I would really like is something like a function like this:

我真正喜欢的是像这样的函数:

(alt 3)

def count(iterable):
  n = 0
  for i in iterable:
    n += 1
  return n

r = xrange(1, 10)

print count(1 for v in r if v % 2 == 0)
print count(1 for v in r if v % 3 == 0)

But this looks a lot like something that could be done without a function. The final variant is this:

但这看起来很像没有功能的事情。最后的变体是这样的:

(alt 4)

r = xrange(1, 10)

print sum(1 for v in r if v % 2 == 0)
print sum(1 for v in r if v % 3 == 0)

and while the smallest (and in my book probably the most elegant) it doesn't feel like it expresses the intent very well.

虽然最小(在我的书中可能是最优雅的)但它并不觉得它很好地表达了意图。

So, my question to you is:

所以,我的问题是:

Which alternative do you like best to gather these types of stats? Feel free to supply your own alternative if you have something better.

您最喜欢哪种方式来收集这些类型的统计数据?如果你有更好的东西,请随意提供自己的选择。

To clear up some confusion below:

为了澄清下面的一些混淆:

In reality my filter predicates are more complex than just this simple test.

实际上,我的过滤谓词比这个简单的测试更复杂。

The objects I iterate over are larger and more complex than just numbers

我迭代的对象比数字更大,更复杂

My filter functions are more different and hard to parameterize into one predicate

我的过滤函数更加不同,难以参数化为一个谓词

12 个解决方案

#1

Having to iterate over the list multiple times isn't elegant IMHO.

不得不多次迭代列表是不优雅的恕我直言。

I'd probably create a function that allows doing:

我可能会创建一个允许执行以下操作的函数:

twos, threes = countmatching(xrange(1,10),
                             lambda a: a % 2 == 0,
                             lambda a: a % 3 == 0)

A starting point would be something like this:

一个起点是这样的:

def countmatching(iterable, *predicates):
    v = [0] * len(predicates)
    for e in iterable:
        for i,p in enumerate(predicates):
            if p(e):
                v[i] += 1
    return tuple(v)

Btw, "itertools recipes" has a recipe for doing much like your alt4.

顺便说一句,“itertools食谱”有一个像你的alt4做的很多的配方。

def quantify(seq, pred=None):
    "Count how many times the predicate is true in the sequence"
    return sum(imap(pred, seq))

#2

Alt 4! But maybe you should refactor the code to a function that takes an argument which should contain the divisible number (two and three). And then you could have a better functionname.

Alt 4!但也许你应该将代码重构为一个函数,该函数接受一个应该包含可分数(两个和三个)的参数。然后你可以有一个更好的功能名称。

def methodName(divNumber, r):
  return sum(1 for v in r if v % divNumber == 0)


print methodName(2, xrange(1, 10))
print methodName(3, xrange(1, 10))

#3

You could use the filter function.

您可以使用过滤功能。

It filters a list (or strictly an iterable) producing a new list containing only the items for which the specified function evaluates to true.

它过滤列表(或严格可迭代),生成一个新列表,其中仅包含指定函数计算结果为true的项。

r = xrange(1, 10)

def is_div_two(n):
    return n % 2 == 0

def is_div_three(n):
    return n % 3 == 0

print len(filter(is_div_two,r))
print len(filter(is_div_three,r))

This is good as it allows you keep your statistics logic contained in a function and the intent of the filter should be pretty clear.

这很好,因为它允许您将统计逻辑包含在函数中,并且过滤器的意图应该非常清楚。

#4

I would choose a small variant of your (alt 4):

我会选择你的一个小变种(alt 4):

def count(predicate, list):
    print sum(1 for x in list if predicate(x))

r = xrange(1, 10)

count(lambda x: x % 2 == 0, r)
count(lambda x: x % 3 == 0, r)
# ...

If you want to change what count does, change its implementation in one place.

如果要更改计数,请在一个位置更改其实现。

Note: since your predicates are complex, you'll probably want to define them in functions instead of lambdas. And so you'll probably want to put all this in a class rather than the global namespace.

注意:由于您的谓词很复杂,您可能希望在函数中定义它们而不是lambdas。因此,您可能希望将所有这些放在一个类而不是全局命名空间中。

#5

Well you could do one list comprehension/expression to get a set of tuples with that stat test in them and then reduce that down to get the sums.

那么你可以做一个列表理解/表达式来获得一组具有该stat测试的元组,然后将其降低以获得总和。


r=xrange(10)
s=( (v % 2 == 0, v % 3 == 0) for v in r )
def add_tuples(t1,t2):
     return tuple(x+y for x,y in zip(t1, t2))
sums=reduce(add_tuples, s, (0,0)) # (0,0) is starting amount

print sums[0] # sum of numbers divisible by 2
print sums[1] # sum of numbers divisible by 3

Using generator expression etc should mean you'll only run through the iterator once (unless reduce does anything odd?). Basically you'd be doing map/reduce...

使用生成器表达式等应该意味着你只会运行一次迭代器(除非reduce做什么奇怪的事情?)。基本上你会做map / reduce ...

#6

True booleans are coerced to unit integers, and false booleans to zero integers. So if you're happy to use scipy or numpy, make an array of integers for each element of your sequence, each array containing one element for each of your tests, and sum over the arrays. E.g.

真正的布尔值被强制转换为单位整数,而假布尔值则强制为零整数。因此,如果您乐意使用scipy或numpy,请为序列的每个元素创建一个整数数组,每个数组包含一个元素用于每个测试,并对数组求和。例如。

>>> sum(scipy.array([c % 2 == 0, c % 3 == 0]) for c in xrange(10))
array([5, 4])

#7

I would definitely be looking at a numpy array instead of an iterable list if you just have numbers. You will almost certainly be able to do what you want with some terse arithmetic on the array.

如果你只有数字,我肯定会看一个numpy数组而不是一个可迭代的列表。几乎可以肯定,你可以通过对阵列进行一些简洁的算术来做你想做的事情。

#8

Not as terse as you are looking for, but more efficient, it actually works with any iterable, not just iterables you can loop over multiple times, and you can expand the things to check for without complicating it further:

不像你想要的那样简洁,但更高效,它实际上适用于任何可迭代的,而不仅仅是你可以循环多次的迭代,你可以扩展要检查的东西而不会使它进一步复杂化:

r = xrange(1, 10)

counts = {
   2: 0,
   3: 0,
}

for v in r:
    for q in counts:
        if not v % q:
            counts[q] += 1
        # Or, more obscure:
        #counts[q] += not v % q

for q in counts:
    print "%s's: %s" % (q, counts[q])

#9

from itertools import groupby
from collections import defaultdict

def multiples(v):
    return 2 if v%2==0 else 3 if v%3==0 else None
d = defaultdict(list)

for k, values in groupby(range(10), multiples):
    if k is not None:
        d[k].extend(values)

#10

Inspired by the OO-stab above, I had to try my hands on one as well (although this is way overkill for the problem I'm trying to solve :)

受到上面的OO-stab的启发,我不得不尝试一下(虽然这对我正在努力解决的问题有点过分杀戮:)

class Stat(object):
  def update(self, n):
    raise NotImplementedError

  def get(self):
    raise NotImplementedError


class TwoStat(Stat):
  def __init__(self):
    self._twos = 0

  def update(self, n):
    if n % 2 == 0: self._twos += 1

  def get(self):
    return self._twos


class ThreeStat(Stat):
  def __init__(self):
    self._threes = 0

  def update(self, n):
    if n % 3 == 0: self._threes += 1

  def get(self):
    return self._threes


class StatCalculator(object):
  def __init__(self, stats):
    self._stats = stats

  def calculate(self, r):
    for v in r:
      for stat in self._stats:
        stat.update(v)
    return tuple(stat.get() for stat in self._stats)


s = StatCalculator([TwoStat(), ThreeStat()])

r = xrange(1, 10)
print s.calculate(r)

#11

Alt 3, for the reason that it doesn't use memory proportional to the number of "hits". Given a pathological case like xrange(one_trillion), many of the other offered solutions would fail badly.

Alt 3,因为它不使用与“命中”数量成比例的内存。鉴于像xrange(one_trillion)这样的病态案例,许多其他提供的解决方案都会失败。

#12

The idea here is to use reduction to avoid repeated iterations. Also, this does not create any extra data structures, if memory is an issue for you. You start with a dictionary with your counters ({'div2': 0, 'div3': 0}) and increment them along the iteration.

这里的想法是使用简化来避免重复迭代。此外,如果内存对您来说是个问题,这不会创建任何额外的数据结构。你从一个包含计数器的字典开始({'div2':0,'div3':0})并沿着迭代递增它们。

def increment_stats(stats, n):
    if n % 2 == 0: stats['div2'] += 1
    if n % 3 == 0: stats['div3'] += 1
    return stats

r = xrange(1, 10)
stats = reduce(increment_stats, r, {'div2': 0, 'div3': 0})
print stats

If you want to count anything more complicated than divisors, it would be appropriate to use a more object-oriented approach (with the same advantages), encapsulating the logic for stats extraction.

如果你想计算比除数更复杂的东西,那么使用更加面向对象的方法(具有相同的优点),封装统计提取的逻辑是合适的。

class Stats:

    def __init__(self, div2=0, div3=0):
        self.div2 = div2
        self.div3 = div3

    def increment(self, n):
        if n % 2 == 0: self.div2 += 1
        if n % 3 == 0: self.div3 += 1
        return self

    def __repr__(self):
        return 'Stats(%d, %d)' % (self.div2, self.div3)

r = xrange(1, 10)
stats = reduce(lambda stats, n: stats.increment(n), r, Stats())
print stats

Please point out any mistakes.

请指出任何错误。

@Henrik: I think the first approach is less maintainable since you have to control initialization of the dictionary in one place and update in another, as well as having to use strings to refer to each stat (instead of having attributes). And I do not think OO is overkill in this case, for you said the predicates and objects will be complex in your application. In fact if the predicates were really simple, I wouldn't even bother to use a dictionary, a single fixed size list would be just fine. Cheers :)

@Henrik:我认为第一种方法不易维护,因为你必须在一个地方控制字典的初始化并在另一个地方更新,以及必须使用字符串来引用每个stat(而不是具有属性)。在这种情况下,我不认为OO是矫枉过正的,因为你说谓词和对象在你的应用程序中会很复杂。事实上,如果谓词非常简单,我甚至不愿意使用字典,单个固定大小的列表就可以了。干杯:)

#1