分割一个NumPy 2d数组,或者如何从nxn数组(n>m)中提取mxm子矩阵?

时间:2022-11-20 21:34:19

I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4x4 and I want to extract a 2x2 array from it.

我想切片一个NumPy nxn数组。我想提取该数组的m行和列的任意选择(例如,在行/列的数量中没有任何模式),使其成为一个新的mxm数组。在本例中,假设数组是4x4,我想从中提取一个2x2数组。

Here is our array:

这是我们的数组:

from numpy import *
x = range(16)
x = reshape(x,(4,4))

print x
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

The line and columns to remove are the same. The easiest case is when I want to extract a 2x2 submatrix that is at the beginning or at the end, i.e. :

要删除的行和列是相同的。最简单的情况是,当我想提取一个2x2子矩阵,它在开始或结束时,也就是:

In [33]: x[0:2,0:2]
Out[33]: 
array([[0, 1],
       [4, 5]])

In [34]: x[2:,2:]
Out[34]: 
array([[10, 11],
       [14, 15]])

But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn't seem to work:

但是,如果我需要删除另一个行/列的混合,该怎么办呢?如果我需要删除第一行和第三行,从而提取子矩阵[[5,7],[13,15],该怎么办?行/行可以有任何组合。我在某个地方读到,我只需要为我的数组和列索引数组/索引索引索引索引索引索引,但这似乎行不通:

In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])

I found one way, which is:

我找到了一种方法,那就是:

    In [61]: x[[1,3]][:,[1,3]]
Out[61]: 
array([[ 5,  7],
       [13, 15]])

First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I'd certainly like to hear it.

第一个问题是它很难读,尽管我可以接受它。如果有人有更好的解决办法,我当然愿意听。

Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?

另一件事是,我在一个论坛上看到,使用数组索引数组会迫使NumPy复制所需的数组,因此,当使用大型数组时,这可能会成为一个问题。为什么是这样/这个机制是如何工作的?

6 个解决方案

#1


46  

As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.

正如Sven所提到的,x[[[0],[2]],[1,3]]将返回与1和3列匹配的0和2行,而x[0,2],[1,3]将返回数组中的x[0,1]和x[2,3]的值。

There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.

对于我给出的第一个示例numpy.ix_,有一个有用的函数。你可以像我的第一个例子一样,用x[numpy.ix_([0,2],[1,3]]]]]来做同样的事情。这可以使您不必输入所有这些额外的括号。

#2


102  

To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let's first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).

要回答这个问题,我们必须研究如何索引多维数组在Numpy中的工作。首先假设你有问题中的数组x。分配给x的缓冲区将包含16个从0到15的升序整数。如果访问一个元素,比如x[i,j], NumPy必须计算出这个元素相对于缓冲区开头的内存位置。这是通过计算i*x来完成的。形状[1]+j(并乘以一个整数的大小来得到一个实际的内存偏移量)。

If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can't use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.

如果您通过基本的切片(如y = x[0:2,0:2])提取子数组,结果对象将与x共享底层缓冲区。NumPy不能使用我* y。由于属于y的数据在内存中不是连续的,所以将[1]+j设置为计算到数组中的偏移量。

NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):

NumPy通过引入跨步来解决这个问题。当计算访问x的内存偏移量[i,j]时,实际计算的是i*x.stride [0]+j*x。大步流星[1](这已经包含了整数大小的因数):

x.strides
(16, 4)

When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):

当像上面那样提取y时,NumPy没有创建一个新的缓冲区,但是它确实创建了一个引用相同缓冲区的新数组对象(否则y就等于x)。新的数组对象将具有不同的形状,然后x,可能还有不同的开始偏移量进入缓冲区,但将与x共享大步流星(至少在这种情况下):

y.shape
(2,2)
y.strides
(16, 4)

This way, computing the memory offset for y[i,j] will yield the correct result.

这样,计算y[i,j]的内存偏移量就会得到正确的结果。

But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won't allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn't be a really lightweight object anymore.

但是NumPy对于z=x[1,3]这样的函数应该做什么呢?如果初始缓冲区用于z,那么大步流星机制将不允许正确的索引。NumPy理论上可以添加一些比大步流星更复杂的机制,但这将使元素访问相对昂贵,在某种程度上违背了数组的整个概念。此外,视图将不再是真正的轻量级对象。

This is covered in depth in the NumPy documentation on indexing.

这在有关索引的NumPy文档中有详细介绍。

Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:

哦,差点忘了你真正的问题:如何让多列表索引像预期的那样工作:

x[[[1],[3]],[1,3]]

This is because the index arrays are broadcasted to a common shape. Of course, for this particular example, you can also make do with basic slicing:

这是因为索引数组被广播到一个通用的形状。当然,对于这个特殊的例子,你也可以用基本的切片来做:

x[1::2, 1::2]

#3


11  

I don't think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:

我不认为x[[1,3]][:,[1,3]]很难读懂。如果你想更清楚你的意图,你可以:

a[[1,3],:][:,[1,3]]

I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.

我不是切片专家,但通常,如果你想要切片到一个数组中并且值是连续的,你会得到一个改变了stride值的视图。

e.g. In your inputs 33 and 34, although you get a 2x2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.

在你的输入33和34中,虽然你得到了一个2x2数组,但步幅是4。因此,当您索引下一行时,指针移动到内存中的正确位置。

Clearly, this mechanism doesn't carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.

显然,这种机制不适用于索引数组。因此,numpy必须复制。毕竟,许多其他的矩阵数学函数依赖于大小、步幅和连续的内存分配。

#4


9  

If you want to skip every other row and every other column, then you can do it with basic slicing:

如果你想跳过每一行和每一列,那么你可以使用基本的切片:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

This returns a view, not a copy of your array.

这将返回一个视图,而不是数组的副本。

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:

而z=x[(1,3)][:][:,(1,3)]使用的是高级索引,因此返回一个副本:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

Note that x is unchanged:

注意x不变:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

If you wish to select arbitrary rows and columns, then you can't use basic slicing. You'll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).

如果希望选择任意的行和列,则不能使用基本的切片。您必须使用高级索引,使用x[rows,:][:,columns]之类的东西,其中的行和列是序列。这当然会给你一个原始数组的拷贝,而不是视图。这是可以预料到的,因为numpy数组使用连续的内存(带常量步长),并且无法生成具有任意行和列的视图(因为这需要非常量步长)。

#5


5  

With numpy, you can pass a slice for each component of the index - so, your x[0:2,0:2] example above works.

使用numpy,您可以为索引的每个组件传递一个切片——因此,上面的x[0:2,0:2]示例可以工作。

If you just want to evenly skip columns or rows, you can pass slices with three components (i.e. start, stop, step).

如果您只是想要均匀地跳过列或行,那么可以通过三个组件(即启动、停止、步骤)传递切片。

Again, for your example above:

同样,对于上面的例子:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.

也就是说,第一个维度的切片,从索引1开始,当索引等于或大于4时停止,每次都向索引添加2。第二个维度也是一样的。同样:这只适用于常数步。

The syntax you got to do something quite different internally - what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that - creating a third array - including only columns 1 and 3 of the previous array.

语法你要做一些完全不同的内部——x[[1,3]][[1,3]]:,实际上是创建一个新的数组只包括从原始数组行1和3(完成了x[[1,3]]部分),然后re-slice -创建一个第三阵列包括只有列1和3的前面的数组。

#6


2  

I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2 .

我在这里有一个类似的问题:用最python化的方式,用ndarray的下标来写ndarray。Python 2。

Following the solution of previous post for your case the solution looks like:

根据您的案例的上一篇文章的解决方案,解决方案如下:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

An using ix_:

一个使用ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

Which is:

这是:

array([[ 5,  7],
       [13, 15]])

#1


46  

As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.

正如Sven所提到的,x[[[0],[2]],[1,3]]将返回与1和3列匹配的0和2行,而x[0,2],[1,3]将返回数组中的x[0,1]和x[2,3]的值。

There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.

对于我给出的第一个示例numpy.ix_,有一个有用的函数。你可以像我的第一个例子一样,用x[numpy.ix_([0,2],[1,3]]]]]来做同样的事情。这可以使您不必输入所有这些额外的括号。

#2


102  

To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let's first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).

要回答这个问题,我们必须研究如何索引多维数组在Numpy中的工作。首先假设你有问题中的数组x。分配给x的缓冲区将包含16个从0到15的升序整数。如果访问一个元素,比如x[i,j], NumPy必须计算出这个元素相对于缓冲区开头的内存位置。这是通过计算i*x来完成的。形状[1]+j(并乘以一个整数的大小来得到一个实际的内存偏移量)。

If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can't use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.

如果您通过基本的切片(如y = x[0:2,0:2])提取子数组,结果对象将与x共享底层缓冲区。NumPy不能使用我* y。由于属于y的数据在内存中不是连续的,所以将[1]+j设置为计算到数组中的偏移量。

NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):

NumPy通过引入跨步来解决这个问题。当计算访问x的内存偏移量[i,j]时,实际计算的是i*x.stride [0]+j*x。大步流星[1](这已经包含了整数大小的因数):

x.strides
(16, 4)

When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):

当像上面那样提取y时,NumPy没有创建一个新的缓冲区,但是它确实创建了一个引用相同缓冲区的新数组对象(否则y就等于x)。新的数组对象将具有不同的形状,然后x,可能还有不同的开始偏移量进入缓冲区,但将与x共享大步流星(至少在这种情况下):

y.shape
(2,2)
y.strides
(16, 4)

This way, computing the memory offset for y[i,j] will yield the correct result.

这样,计算y[i,j]的内存偏移量就会得到正确的结果。

But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won't allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn't be a really lightweight object anymore.

但是NumPy对于z=x[1,3]这样的函数应该做什么呢?如果初始缓冲区用于z,那么大步流星机制将不允许正确的索引。NumPy理论上可以添加一些比大步流星更复杂的机制,但这将使元素访问相对昂贵,在某种程度上违背了数组的整个概念。此外,视图将不再是真正的轻量级对象。

This is covered in depth in the NumPy documentation on indexing.

这在有关索引的NumPy文档中有详细介绍。

Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:

哦,差点忘了你真正的问题:如何让多列表索引像预期的那样工作:

x[[[1],[3]],[1,3]]

This is because the index arrays are broadcasted to a common shape. Of course, for this particular example, you can also make do with basic slicing:

这是因为索引数组被广播到一个通用的形状。当然,对于这个特殊的例子,你也可以用基本的切片来做:

x[1::2, 1::2]

#3


11  

I don't think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:

我不认为x[[1,3]][:,[1,3]]很难读懂。如果你想更清楚你的意图,你可以:

a[[1,3],:][:,[1,3]]

I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.

我不是切片专家,但通常,如果你想要切片到一个数组中并且值是连续的,你会得到一个改变了stride值的视图。

e.g. In your inputs 33 and 34, although you get a 2x2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.

在你的输入33和34中,虽然你得到了一个2x2数组,但步幅是4。因此,当您索引下一行时,指针移动到内存中的正确位置。

Clearly, this mechanism doesn't carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.

显然,这种机制不适用于索引数组。因此,numpy必须复制。毕竟,许多其他的矩阵数学函数依赖于大小、步幅和连续的内存分配。

#4


9  

If you want to skip every other row and every other column, then you can do it with basic slicing:

如果你想跳过每一行和每一列,那么你可以使用基本的切片:

In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]: 
array([[ 5,  7],
       [13, 15]])

This returns a view, not a copy of your array.

这将返回一个视图,而不是数组的副本。

In [51]: y=x[1:4:2,1:4:2]

In [52]: y[0,0]=100

In [53]: x   # <---- Notice x[1,1] has changed
Out[53]: 
array([[  0,   1,   2,   3],
       [  4, 100,   6,   7],
       [  8,   9,  10,  11],
       [ 12,  13,  14,  15]])

while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:

而z=x[(1,3)][:][:,(1,3)]使用的是高级索引,因此返回一个副本:

In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]

In [60]: z
Out[60]: 
array([[ 5,  7],
       [13, 15]])

In [61]: z[0,0]=0

Note that x is unchanged:

注意x不变:

In [62]: x
Out[62]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

If you wish to select arbitrary rows and columns, then you can't use basic slicing. You'll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).

如果希望选择任意的行和列,则不能使用基本的切片。您必须使用高级索引,使用x[rows,:][:,columns]之类的东西,其中的行和列是序列。这当然会给你一个原始数组的拷贝,而不是视图。这是可以预料到的,因为numpy数组使用连续的内存(带常量步长),并且无法生成具有任意行和列的视图(因为这需要非常量步长)。

#5


5  

With numpy, you can pass a slice for each component of the index - so, your x[0:2,0:2] example above works.

使用numpy,您可以为索引的每个组件传递一个切片——因此,上面的x[0:2,0:2]示例可以工作。

If you just want to evenly skip columns or rows, you can pass slices with three components (i.e. start, stop, step).

如果您只是想要均匀地跳过列或行,那么可以通过三个组件(即启动、停止、步骤)传递切片。

Again, for your example above:

同样,对于上面的例子:

>>> x[1:4:2, 1:4:2]
array([[ 5,  7],
       [13, 15]])

Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.

也就是说,第一个维度的切片,从索引1开始,当索引等于或大于4时停止,每次都向索引添加2。第二个维度也是一样的。同样:这只适用于常数步。

The syntax you got to do something quite different internally - what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that - creating a third array - including only columns 1 and 3 of the previous array.

语法你要做一些完全不同的内部——x[[1,3]][[1,3]]:,实际上是创建一个新的数组只包括从原始数组行1和3(完成了x[[1,3]]部分),然后re-slice -创建一个第三阵列包括只有列1和3的前面的数组。

#6


2  

I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2 .

我在这里有一个类似的问题:用最python化的方式,用ndarray的下标来写ndarray。Python 2。

Following the solution of previous post for your case the solution looks like:

根据您的案例的上一篇文章的解决方案,解决方案如下:

columns_to_keep = [1,3] 
rows_to_keep = [1,3]

An using ix_:

一个使用ix_:

x[np.ix_(rows_to_keep, columns_to_keep)] 

Which is:

这是:

array([[ 5,  7],
       [13, 15]])