如何有效地用id求和和表示2D NumPy数组?

时间:2021-08-16 21:29:15

I have a 2d array a and a 1d array b. I want to compute the sum of rows in array a group by each id in b. For example:

我有一个2d数组a和一个1d数组b,我想用b中的每个id计算数组a组中的行数之和。

import numpy as np

a = np.array([[1,2,3],[2,3,4],[4,5,6]])
b = np.array([0,1,0])
count = len(b)
ls = list(set(b))
res = np.zeros((len(ls),a.shape[1]))
for i in ls:
    res[i] = np.array([a[x] for x in range(0,count) if b[x] == i]).sum(axis=0)
print res

I got the printed result as:

我得到的打印结果如下:

[[ 5.  7.  9.]
 [ 2.  3.  4.]]

What I want to do is, since the 1st and 3rd elements of b are 0, I perform a[0]+a[2], which is [5, 7, 9] as one row of the results. Similarly, the 2nd element of b is 1, so that I perform a[1], which is [2, 3, 4] as another row of the results.

我要做的是,由于b的第1和第3个元素是0,我将[0]+a[2],也就是[5,7,9]作为结果的一行。同样,b的第二个元素是1,所以我执行[1],也就是[2,3,4]作为结果的另一行。

But it seems my implementation is quite slow for large array. Is there any better implementation?

但是,对于大型数组,我的实现似乎很慢。有更好的实现吗?

I know there is a bincount function in numpy. But it seems only supports 1d array. Thank you all for helping me!

我知道numpy中有一个bincount函数。但它似乎只支持一维数组。谢谢大家对我的帮助!

2 个解决方案

#1


2  

Approach #1

方法# 1

You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -

您可以使用np.add。它适用于一般尺寸的ndarrays,不像np。需要1D数组的bincount -

np.add.at(res, b, a)

Sample run -

样本运行-

In [40]: a
Out[40]: 
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [41]: b
Out[41]: array([0, 1, 0])

In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)

In [46]: np.add.at(res, b, a)

In [47]: res
Out[47]: 
array([[5, 7, 9],
       [2, 3, 4]])

To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -

要计算平均值,我们需要使用np。bincount获取每个标签/标签的计数,然后与每一行的计数一起除以,就像这样

In [49]: res/np.bincount(b)[:,None].astype(float)
Out[49]: 
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -

推广到处理不一定是从0开始的b,我们可以使它具有通用性,并引入一个很好的小函数来处理求和和平均值,就像这样

def groupby_addat(a, b, out="sum"):
    unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
    res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
    np.add.at(res, tags, a)

    if out=="mean":
        return unqb, res/counts[:,None].astype(float)
    elif out=="sum":
        return unqb, res
    else:
        print "Invalid output"
        return None

Sample run -

样本运行-

In [201]: a
Out[201]: 
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [204]: b_ids, means = groupby_addat(a, b, out="mean")

In [205]: b_ids
Out[205]: array([ 5, 10])

In [206]: means
Out[206]: 
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Approach #2

方法# 2

We could also make use of np.add.reduceat and might be more performant -

我们也可以使用np.add。减少和可能是更多的表现-

def groupby_addreduceat(a, b, out="sum"):
    sidx = b.argsort()
    sb = b[sidx]
    spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
    sums = np.add.reduceat(a[sidx],spt_idx[:-1])

    if out=="mean":
        counts = spt_idx[1:] - spt_idx[:-1]
        return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
    elif out=="sum":
        return sb[spt_idx[:-1]], sums
    else:
        print "Invalid output"
        return None

Sample run -

样本运行-

In [201]: a
Out[201]: 
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")

In [208]: b_ids
Out[208]: array([ 5, 10])

In [209]: means
Out[209]: 
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

#2


3  

The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:

numpy_indexpackage(免责声明:我是它的作者)以一种有效的向量化和一般的方式来解决这类问题:

import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)

Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.

注意,这个解决方案是通用的,它提供了一组丰富的标准降低函数(和,最小,意思是,中位数,argmin等等),轴关键字如果你需要使用不同的轴,也能够通过更复杂的事情不仅仅是正整数数组,如任意dtype的多维数组的元素。

import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)

#1


2  

Approach #1

方法# 1

You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -

您可以使用np.add。它适用于一般尺寸的ndarrays,不像np。需要1D数组的bincount -

np.add.at(res, b, a)

Sample run -

样本运行-

In [40]: a
Out[40]: 
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [41]: b
Out[41]: array([0, 1, 0])

In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)

In [46]: np.add.at(res, b, a)

In [47]: res
Out[47]: 
array([[5, 7, 9],
       [2, 3, 4]])

To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -

要计算平均值,我们需要使用np。bincount获取每个标签/标签的计数,然后与每一行的计数一起除以,就像这样

In [49]: res/np.bincount(b)[:,None].astype(float)
Out[49]: 
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -

推广到处理不一定是从0开始的b,我们可以使它具有通用性,并引入一个很好的小函数来处理求和和平均值,就像这样

def groupby_addat(a, b, out="sum"):
    unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
    res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
    np.add.at(res, tags, a)

    if out=="mean":
        return unqb, res/counts[:,None].astype(float)
    elif out=="sum":
        return unqb, res
    else:
        print "Invalid output"
        return None

Sample run -

样本运行-

In [201]: a
Out[201]: 
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [204]: b_ids, means = groupby_addat(a, b, out="mean")

In [205]: b_ids
Out[205]: array([ 5, 10])

In [206]: means
Out[206]: 
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Approach #2

方法# 2

We could also make use of np.add.reduceat and might be more performant -

我们也可以使用np.add。减少和可能是更多的表现-

def groupby_addreduceat(a, b, out="sum"):
    sidx = b.argsort()
    sb = b[sidx]
    spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
    sums = np.add.reduceat(a[sidx],spt_idx[:-1])

    if out=="mean":
        counts = spt_idx[1:] - spt_idx[:-1]
        return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
    elif out=="sum":
        return sb[spt_idx[:-1]], sums
    else:
        print "Invalid output"
        return None

Sample run -

样本运行-

In [201]: a
Out[201]: 
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")

In [208]: b_ids
Out[208]: array([ 5, 10])

In [209]: means
Out[209]: 
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

#2


3  

The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:

numpy_indexpackage(免责声明:我是它的作者)以一种有效的向量化和一般的方式来解决这类问题:

import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)

Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.

注意,这个解决方案是通用的,它提供了一组丰富的标准降低函数(和,最小,意思是,中位数,argmin等等),轴关键字如果你需要使用不同的轴,也能够通过更复杂的事情不仅仅是正整数数组,如任意dtype的多维数组的元素。

import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)