如何有效地用id求和和表示2D NumPy数组?

时间:2021-08-16 21:29:15

I have a 2d array a and a 1d array b. I want to compute the sum of rows in array a group by each id in b. For example:


import numpy as np

a = np.array([[1,2,3],[2,3,4],[4,5,6]])
b = np.array([0,1,0])
count = len(b)
ls = list(set(b))
res = np.zeros((len(ls),a.shape[1]))
for i in ls:
    res[i] = np.array([a[x] for x in range(0,count) if b[x] == i]).sum(axis=0)
print res

I got the printed result as:


[[ 5.  7.  9.]
 [ 2.  3.  4.]]

What I want to do is, since the 1st and 3rd elements of b are 0, I perform a[0]+a[2], which is [5, 7, 9] as one row of the results. Similarly, the 2nd element of b is 1, so that I perform a[1], which is [2, 3, 4] as another row of the results.


But it seems my implementation is quite slow for large array. Is there any better implementation?


I know there is a bincount function in numpy. But it seems only supports 1d array. Thank you all for helping me!


2 个解决方案



Approach #1

方法# 1

You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -

您可以使用np.add。它适用于一般尺寸的ndarrays,不像np。需要1D数组的bincount -

np.add.at(res, b, a)

Sample run -


In [40]: a
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [41]: b
Out[41]: array([0, 1, 0])

In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)

In [46]: np.add.at(res, b, a)

In [47]: res
array([[5, 7, 9],
       [2, 3, 4]])

To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -


In [49]: res/np.bincount(b)[:,None].astype(float)
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -


def groupby_addat(a, b, out="sum"):
    unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
    res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
    np.add.at(res, tags, a)

    if out=="mean":
        return unqb, res/counts[:,None].astype(float)
    elif out=="sum":
        return unqb, res
        print "Invalid output"
        return None

Sample run -


In [201]: a
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [204]: b_ids, means = groupby_addat(a, b, out="mean")

In [205]: b_ids
Out[205]: array([ 5, 10])

In [206]: means
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Approach #2

方法# 2

We could also make use of np.add.reduceat and might be more performant -


def groupby_addreduceat(a, b, out="sum"):
    sidx = b.argsort()
    sb = b[sidx]
    spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
    sums = np.add.reduceat(a[sidx],spt_idx[:-1])

    if out=="mean":
        counts = spt_idx[1:] - spt_idx[:-1]
        return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
    elif out=="sum":
        return sb[spt_idx[:-1]], sums
        print "Invalid output"
        return None

Sample run -


In [201]: a
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")

In [208]: b_ids
Out[208]: array([ 5, 10])

In [209]: means
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])



The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:


import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)

Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.


import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)



Approach #1

方法# 1

You can use np.add.at, which works for ndarrays of generic dimensions, unlike np.bincount that expects only 1D arrays -

您可以使用np.add。它适用于一般尺寸的ndarrays,不像np。需要1D数组的bincount -

np.add.at(res, b, a)

Sample run -


In [40]: a
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [41]: b
Out[41]: array([0, 1, 0])

In [45]: res = np.zeros((b.max()+1, a.shape[1]), dtype=a.dtype)

In [46]: np.add.at(res, b, a)

In [47]: res
array([[5, 7, 9],
       [2, 3, 4]])

To compute mean values, we need to use np.bincount to get the counts per label/tag and then divide with those along each row, like so -


In [49]: res/np.bincount(b)[:,None].astype(float)
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Generalizing to handle b that are not necessarily in sequence from 0, we could make it generic and put in a nice little function to handle summations and averages in a cleaner way, like so -


def groupby_addat(a, b, out="sum"):
    unqb, tags, counts = np.unique(b, return_inverse=1, return_counts=1)
    res = np.zeros((tags.max()+1, a.shape[1]), dtype=a.dtype)
    np.add.at(res, tags, a)

    if out=="mean":
        return unqb, res/counts[:,None].astype(float)
    elif out=="sum":
        return unqb, res
        print "Invalid output"
        return None

Sample run -


In [201]: a
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [204]: b_ids, means = groupby_addat(a, b, out="mean")

In [205]: b_ids
Out[205]: array([ 5, 10])

In [206]: means
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])

Approach #2

方法# 2

We could also make use of np.add.reduceat and might be more performant -


def groupby_addreduceat(a, b, out="sum"):
    sidx = b.argsort()
    sb = b[sidx]
    spt_idx =np.concatenate(([0], np.flatnonzero(sb[1:] != sb[:-1])+1, [sb.size]))
    sums = np.add.reduceat(a[sidx],spt_idx[:-1])

    if out=="mean":
        counts = spt_idx[1:] - spt_idx[:-1]
        return sb[spt_idx[:-1]], sums/counts[:,None].astype(float)
    elif out=="sum":
        return sb[spt_idx[:-1]], sums
        print "Invalid output"
        return None

Sample run -


In [201]: a
array([[1, 2, 3],
       [2, 3, 4],
       [4, 5, 6]])

In [202]: b
Out[202]: array([ 5, 10,  5])

In [207]: b_ids, means = groupby_addreduceat(a, b, out="mean")

In [208]: b_ids
Out[208]: array([ 5, 10])

In [209]: means
array([[ 2.5,  3.5,  4.5],
       [ 2. ,  3. ,  4. ]])



The numpy_indexed package (disclaimer: I am its author) was made to address problems exactly of this kind in an efficiently vectorized and general manner:


import numpy_indexed as npi
unique_b, mean_a = npi.group_by(b).mean(a)

Note that this solution is general in the sense that it provides a rich set of standard reduction function (sum, min, mean, median, argmin, and so on), axis keywords if you need to work with different axes, and also the ability to group by more complicated things than just positive integer arrays, such as the elements of multidimensional arrays of arbitrary dtype.


import numpy_indexed as npi
# this caches the complicated O(NlogN) part of the operations
groups = npi.group_by(b)
# all these subsequent operations have the same low vectorized O(N) cost
unique_b, mean_a = groups.mean(a)
unique_b, sum_a = groups.sum(a)
unique_b, min_a = groups.min(a)