计算二维NumPy数组中每一行和每一列中的非零元素

时间:2021-10-25 21:22:55

I have a NumPy matrix that contains mostly non-zero values, but occasionally will contain a zero value. I need to be able to:

我有一个NumPy矩阵,它包含大部分非零值,但偶尔也会包含一个零值。我需要能够:

  1. Count the non-zero values in each row and put that count into a variable that I can use in subsequent operations, perhaps by iterating through row indices and performing the calculations during the iterative process.

    计算每一行的非零值,并将其放入一个变量中,我可以在后续操作中使用这个变量,可能是通过迭代行索引和在迭代过程中执行计算。

  2. Count the non-zero values in each column and put that count into a variable that I can use in subsequent operations, perhaps by iterating through column indices and performing the calculations during the iterative process.

    计算每个列中的非零值,并将该值放入一个变量中,我可以在后续操作中使用这个变量,可能是通过迭代列索引并在迭代过程中执行计算。

For example, one thing I need to do is to sum each row and then divide each row sum by the number of non-zero values in each row, reporting a separate result for each row index. And then I need to sum each column and then divide the column sum by the number of non-zero values in the column, also reporting a separate result for each column index. I need to do other things as well, but they should be easy after I figure out how to do the things that I am listing here.

例如,我需要做的一件事是对每一行求和,然后将每一行的和除以每一行中非零值的数量,为每一行索引报告一个单独的结果。然后我需要对每一列进行求和,然后将列和除以列中非零值的数量,并为每个列索引报告一个单独的结果。我还需要做其他的事情,但是在我知道如何做我在这里列出的事情之后,它们应该是很容易的。

The code I am working with is below. You can see that I am creating an array of zeros and then populating it from a csv file. Some of the rows will contain values for all the columns, but other rows will still have some zeros remaining in some of the last columns, thus creating the problem described above.

下面是我正在处理的代码。您可以看到,我正在创建一个0数组,然后从csv文件中填充它。有些行将包含所有列的值,但其他行仍然在最后的一些列中保留一些0,从而产生上面描述的问题。

The last five lines of the code below are from another posting on this forum. These last five lines of code return a printed list of row/column indices for the zeros. However, I do not know how to use that resulting information to create the non-zero row counts and non-zero column counts described above.

下面的最后五行代码来自这个论坛上的另一篇文章。最后五行代码返回一个打印出来的0的行/列索引列表。但是,我不知道如何使用结果信息来创建上面描述的非零行计数和非零列计数。

ANOVAInputMatrixValuesArray=zeros([len(TestIDs),9],float)
j=0
for j in range(0,len(TestIDs)):
    TestID=str(TestIDs[j])
    ReadOrWrite='Read'
    fileName=inputFileName
    directory=GetCurrentDirectory(arguments that return correct directory)
    inputfile=open(directory,'r')
    reader=csv.reader(inputfile)
    m=0
    for row in reader:
        if m<9:
            if row[0]!='TestID':
                ANOVAInputMatrixValuesArray[(j-1),m]=row[2]
                m+=1
    inputfile.close()

IndicesOfZeros = indices(ANOVAInputMatrixValuesArray.shape) 
locs = IndicesOfZeros[:,ANOVAInputMatrixValuesArray == 0]
pts = hsplit(locs, len(locs[0]))
for pt in pts:
    print(', '.join(str(p[0]) for p in pt))

Can anyone help me with this?

有人能帮我一下吗?

4 个解决方案

#1


28  

import numpy as np

a = np.array([[1, 0, 1],
              [2, 3, 4],
              [0, 0, 7]])

columns = (a != 0).sum(0)
rows    = (a != 0).sum(1)

The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.

变量(a != 0)是与原始a相同形状的数组,它包含对所有非零元素的True。

The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.

函数的作用是:对轴x上的元素求和,真/假元素的和就是真元素的个数。

The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:

变量列和行包含原始数组的每个列/行中的非零(元素!= 0)值的数量:

columns = np.array([2, 1, 3])
rows    = np.array([2, 3, 1])

EDIT: The whole code could look like this (with a few simplifications in your original code):

编辑:整个代码可以是这样的(在原始代码中有一些简化):

ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
    ReadOrWrite = 'Read'
    fileName = inputFileName
    directory = GetCurrentDirectory(arguments that return correct directory)
    # use directory or filename to get the CSV file?
    with open(directory, 'r') as csvfile:
        ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]

nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)

EDIT 2:

编辑2:

To get the mean value of all columns/rows, use the following:

要获取所有列/行的平均值,请使用以下方法:

colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)

What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.

如果列/行中没有非零元素,您想做什么?然后我们可以修改代码来解决这样的问题。

#2


9  

A fast way to count nonzero elements per row in a scipy sparse matrix m is:

在scipy稀疏矩阵m中每一行计数非零元素的一种快速方法是:

np.diff(m.tocsr().indptr)

The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.

CSR矩阵的indptr属性表示数据中与行之间边界相对应的索引。因此计算每个条目之间的差异将提供每一行中非零元素的数量。

Similarly, for the number of nonzero elements in each column, use:

同样,对于每个列中非零元素的数量,使用:

np.diff(m.tocsc().indptr)

If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.

如果数据已经在适当的格式中,这些数据将分别运行在O(m.shape[0])和O(m.shape[1])中,而不是在Marat和Finn的解决方案中运行O(m.getnnz())。

If you need both row and column nozero counts, and, say, m is already a CSR, you might use:

如果您同时需要行和列no0计数,并且,比方说,m已经是CSR,您可以使用:

row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)

which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.

它并不比第一次转换到CSC(即O(m.getnnz()))获得col_nonzeros的速度渐进快,但由于实现细节,速度更快。

#3


2  

The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:

更快的方法是用1而不是实值克隆你的矩阵。然后按行或列求和:

X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)

That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)

这对我来说比芬恩·阿鲁普·尼尔森的解决方案快了50倍(1秒对53秒)

edit: Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by

编辑:也许您需要将NumNonZeroElementsByColumn转换为一维数组by

np.array(NumNonZeroElementsByColumn)[0]

#4


0  

(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.

(a != 0)在我现在的scipy版本中不适用于稀疏矩阵(scipy.sparse.lil_matrix)。

For sparse matrices I did:

对于稀疏矩阵,我这样做了:

    (i,j) = X.nonzero()
    column_sums = np.zeros(X.shape[1])
    for n in np.asarray(j).ravel():
        column_sums[n] += 1.

I wonder if there is a more elegant way.

我想知道有没有更优雅的方式。

#1


28  

import numpy as np

a = np.array([[1, 0, 1],
              [2, 3, 4],
              [0, 0, 7]])

columns = (a != 0).sum(0)
rows    = (a != 0).sum(1)

The variable (a != 0) is an array of the same shape as original a and it contains True for all non-zero elements.

变量(a != 0)是与原始a相同形状的数组,它包含对所有非零元素的True。

The .sum(x) function sums the elements over the axis x. Sum of True/False elements is the number of True elements.

函数的作用是:对轴x上的元素求和,真/假元素的和就是真元素的个数。

The variables columns and rows contain the number of non-zero (element != 0) values in each column/row of your original array:

变量列和行包含原始数组的每个列/行中的非零(元素!= 0)值的数量:

columns = np.array([2, 1, 3])
rows    = np.array([2, 3, 1])

EDIT: The whole code could look like this (with a few simplifications in your original code):

编辑:整个代码可以是这样的(在原始代码中有一些简化):

ANOVAInputMatrixValuesArray = zeros([len(TestIDs), 9], float)
for j, TestID in enumerate(TestIDs):
    ReadOrWrite = 'Read'
    fileName = inputFileName
    directory = GetCurrentDirectory(arguments that return correct directory)
    # use directory or filename to get the CSV file?
    with open(directory, 'r') as csvfile:
        ANOVAInputMatrixValuesArray[j,:] = loadtxt(csvfile, comments='TestId', delimiter=';', usecols=(2,))[:9]

nonZeroCols = (ANOVAInputMatrixValuesArray != 0).sum(0)
nonZeroRows = (ANOVAInputMatrixValuesArray != 0).sum(1)

EDIT 2:

编辑2:

To get the mean value of all columns/rows, use the following:

要获取所有列/行的平均值,请使用以下方法:

colMean = a.sum(0) / (a != 0).sum(0)
rowMean = a.sum(1) / (a != 0).sum(1)

What do you want to do if there are no non-zero elements in a column/row? Then we can adapt the code to solve such a problem.

如果列/行中没有非零元素,您想做什么?然后我们可以修改代码来解决这样的问题。

#2


9  

A fast way to count nonzero elements per row in a scipy sparse matrix m is:

在scipy稀疏矩阵m中每一行计数非零元素的一种快速方法是:

np.diff(m.tocsr().indptr)

The indptr attribute of a CSR matrix indicates the indices within the data corresponding to the boundaries between rows. So calculating the difference between each entry will provide the number of non-zero elements in each row.

CSR矩阵的indptr属性表示数据中与行之间边界相对应的索引。因此计算每个条目之间的差异将提供每一行中非零元素的数量。

Similarly, for the number of nonzero elements in each column, use:

同样,对于每个列中非零元素的数量,使用:

np.diff(m.tocsc().indptr)

If the data is already in the appropriate form, these will run in O(m.shape[0]) and O(m.shape[1]) respectively, rather than O(m.getnnz()) in Marat and Finn's solutions.

如果数据已经在适当的格式中,这些数据将分别运行在O(m.shape[0])和O(m.shape[1])中,而不是在Marat和Finn的解决方案中运行O(m.getnnz())。

If you need both row and column nozero counts, and, say, m is already a CSR, you might use:

如果您同时需要行和列no0计数,并且,比方说,m已经是CSR,您可以使用:

row_nonzeros = np.diff(m.indptr)
col_nonzeros = np.bincount(m.indices)

which is not asymptotically faster than first converting to CSC (which is O(m.getnnz())) to get col_nonzeros, but is faster because of implementation details.

它并不比第一次转换到CSC(即O(m.getnnz()))获得col_nonzeros的速度渐进快,但由于实现细节,速度更快。

#3


2  

The faster way is to clone your matrix with ones instead of real values. Then just sum up by rows or columns:

更快的方法是用1而不是实值克隆你的矩阵。然后按行或列求和:

X_clone = X.tocsc()
X_clone.data = np.ones( X_clone.data.shape )
NumNonZeroElementsByColumn = X_clone.sum(0)
NumNonZeroElementsByRow = X_clone.sum(1)

That worked 50 times faster for me than Finn Årup Nielsen's solution (1 second against 53)

这对我来说比芬恩·阿鲁普·尼尔森的解决方案快了50倍(1秒对53秒)

edit: Perhaps you will need to translate NumNonZeroElementsByColumn into 1-dimensional array by

编辑:也许您需要将NumNonZeroElementsByColumn转换为一维数组by

np.array(NumNonZeroElementsByColumn)[0]

#4


0  

(a != 0) does not work for sparse matrices (scipy.sparse.lil_matrix) in my present version of scipy.

(a != 0)在我现在的scipy版本中不适用于稀疏矩阵(scipy.sparse.lil_matrix)。

For sparse matrices I did:

对于稀疏矩阵,我这样做了:

    (i,j) = X.nonzero()
    column_sums = np.zeros(X.shape[1])
    for n in np.asarray(j).ravel():
        column_sums[n] += 1.

I wonder if there is a more elegant way.

我想知道有没有更优雅的方式。