Numpy:对于一个数组中的每个元素,在另一个数组中找到索引

时间:2021-05-29 12:33:26

I have two 1D arrays, x & y, one smaller than the other. I'm trying to find the index of every element of y in x.

我有两个一维数组,x和y,一个比另一个小。我要找出x中y的每一个元素的指数。

I've found two naive ways to do this, the first is slow, and the second memory-intensive.

我发现了两种简单的方法,第一种是慢的,第二种是记忆密集型的。

The slow way

indices= []
for iy in y:
    indices += np.where(x==iy)[0][0]

The memory hog

xe = np.outer([1,]*len(x), y)
ye = np.outer(x, [1,]*len(y))
junk, indices = np.where(np.equal(xe, ye))

Is there a faster way or less memory intensive approach? Ideally the search would take advantage of the fact that we are searching for not one thing in a list, but many things, and thus is slightly more amenable to parallelization. Bonus points if you don't assume that every element of y is actually in x.

有更快的方法还是更少的内存密集型方法?理想情况下,搜索将利用这样一个事实:我们正在搜索的不是列表中的一个东西,而是许多东西,因此稍微更易于并行化。如果你不假设y中的每一个元素都是x,你就可以得到积分。

6 个解决方案

#1


18  

As Joe Kington said, searchsorted() can search element very quickly. To deal with elements that are not in x, you can check the searched result with original y, and create a masked array:

正如Joe Kington所说,searchsort()可以非常快速地搜索元素。要处理不在x中的元素,可以使用原始的y检查搜索结果,并创建掩码数组:

import numpy as np
x = np.array([3,5,7,1,9,8,6,6])
y = np.array([2,1,5,10,100,6])

index = np.argsort(x)
sorted_x = x[index]
sorted_index = np.searchsorted(sorted_x, y)

yindex = np.take(index, sorted_index, mode="clip")
mask = x[yindex] != y

result = np.ma.array(yindex, mask=mask)
print result

the result is:

其结果是:

[-- 3 1 -- -- 6]

#2


20  

I want to suggest one-line solution:

我想提出一个简单的解决方案:

indices = np.where(np.in1d(x, y))[0]

The result is an array with indices for x array which corresponds to elements from y which were found in x.

结果是一个带有x数组索引的数组,该数组对应于x中发现的y元素。

One can use it without numpy.where if needs.

没有麻木就可以使用它。如果需要。

#3


12  

How about this?

这个怎么样?

It does assume that every element of y is in x, (and will return results even for elements that aren't!) but it is much faster.

它确实假设y的每个元素都在x中,(即使是不存在的元素也会返回结果!)

import numpy as np

# Generate some example data...
x = np.arange(1000)
np.random.shuffle(x)
y = np.arange(100)

# Actually preform the operation...
xsorted = np.argsort(x)
ypos = np.searchsorted(x[xsorted], y)
indices = xsorted[ypos]

#4


3  

I would just do this:

我这样做:

indices = np.where(y[:, None] == x[None, :])[1]

Unlike your memory-hog way, this makes use of broadcast to directly generate 2D boolean array without creating 2D arrays for both x and y.

与内存占用方式不同,这种方法利用广播直接生成2D布尔数组,而无需为x和y创建2D数组。

#5


1  

The numpy_indexed package (disclaimer: I am its author) contains a function that does exactly this:

numpy_indexpackage(免责声明:我是它的作者)包含一个函数,它的作用是:

import numpy_indexed as npi
indices = npi.indices(x, y, missing='mask')

It will currently raise a KeyError if not all elements in y are present in x; but perhaps I should add a kwarg so that one can elect to mark such items with a -1 or something.

如果不是y中的所有元素都存在于x中,则会引发KeyError;但是,也许我应该添加一个kwarg,这样人们就可以选择用-1或其他东西来标记这些项目。

It should have the same efficiency as the currently accepted answer, since the implementation is along similar lines. numpy_indexed is however more flexible, and also allows to search for indices of rows of multidimensional arrays, for instance.

它应该具有与当前所接受的答案相同的效率,因为实现的方式类似。但是numpy_index更加灵活,并且允许搜索多维数组的行索引。

EDIT: ive changed the handling of missing values; the 'missing' kwarg can now be set with 'raise', 'ignore' or 'mask'. In the latter case you get a masked array of the same length of y, on which you can call .compressed() to get the valid indices. Note that there is also npi.contains(x, y) if this is all you need to know.

编辑:我更改了对丢失值的处理;“失踪”kwarg现在可以设置为“raise”、“ignore”或“mask”。在后一种情况下,您将获得一个与y长度相同的屏蔽数组,您可以在该数组上调用. compressim()来获取有效的索引。注意还有npi。包含(x, y)如果这就是你需要知道的。

#6


0  

A more direct solution, that doesn't expect the array to be sorted.

一个更直接的解决方案,它不期望数组被排序。

import pandas as pd
A = pd.Series(['amsterdam', 'delhi', 'chromepet', 'tokyo', 'others'])
B = pd.Series(['chromepet', 'tokyo', 'tokyo', 'delhi', 'others'])

# Find index position of B's items in A
B.map(lambda x: np.where(A==x)[0][0]).tolist()

Result is:

结果是:

[2, 3, 3, 1, 4]

#1


18  

As Joe Kington said, searchsorted() can search element very quickly. To deal with elements that are not in x, you can check the searched result with original y, and create a masked array:

正如Joe Kington所说,searchsort()可以非常快速地搜索元素。要处理不在x中的元素,可以使用原始的y检查搜索结果,并创建掩码数组:

import numpy as np
x = np.array([3,5,7,1,9,8,6,6])
y = np.array([2,1,5,10,100,6])

index = np.argsort(x)
sorted_x = x[index]
sorted_index = np.searchsorted(sorted_x, y)

yindex = np.take(index, sorted_index, mode="clip")
mask = x[yindex] != y

result = np.ma.array(yindex, mask=mask)
print result

the result is:

其结果是:

[-- 3 1 -- -- 6]

#2


20  

I want to suggest one-line solution:

我想提出一个简单的解决方案:

indices = np.where(np.in1d(x, y))[0]

The result is an array with indices for x array which corresponds to elements from y which were found in x.

结果是一个带有x数组索引的数组,该数组对应于x中发现的y元素。

One can use it without numpy.where if needs.

没有麻木就可以使用它。如果需要。

#3


12  

How about this?

这个怎么样?

It does assume that every element of y is in x, (and will return results even for elements that aren't!) but it is much faster.

它确实假设y的每个元素都在x中,(即使是不存在的元素也会返回结果!)

import numpy as np

# Generate some example data...
x = np.arange(1000)
np.random.shuffle(x)
y = np.arange(100)

# Actually preform the operation...
xsorted = np.argsort(x)
ypos = np.searchsorted(x[xsorted], y)
indices = xsorted[ypos]

#4


3  

I would just do this:

我这样做:

indices = np.where(y[:, None] == x[None, :])[1]

Unlike your memory-hog way, this makes use of broadcast to directly generate 2D boolean array without creating 2D arrays for both x and y.

与内存占用方式不同,这种方法利用广播直接生成2D布尔数组,而无需为x和y创建2D数组。

#5


1  

The numpy_indexed package (disclaimer: I am its author) contains a function that does exactly this:

numpy_indexpackage(免责声明:我是它的作者)包含一个函数,它的作用是:

import numpy_indexed as npi
indices = npi.indices(x, y, missing='mask')

It will currently raise a KeyError if not all elements in y are present in x; but perhaps I should add a kwarg so that one can elect to mark such items with a -1 or something.

如果不是y中的所有元素都存在于x中,则会引发KeyError;但是,也许我应该添加一个kwarg,这样人们就可以选择用-1或其他东西来标记这些项目。

It should have the same efficiency as the currently accepted answer, since the implementation is along similar lines. numpy_indexed is however more flexible, and also allows to search for indices of rows of multidimensional arrays, for instance.

它应该具有与当前所接受的答案相同的效率,因为实现的方式类似。但是numpy_index更加灵活,并且允许搜索多维数组的行索引。

EDIT: ive changed the handling of missing values; the 'missing' kwarg can now be set with 'raise', 'ignore' or 'mask'. In the latter case you get a masked array of the same length of y, on which you can call .compressed() to get the valid indices. Note that there is also npi.contains(x, y) if this is all you need to know.

编辑:我更改了对丢失值的处理;“失踪”kwarg现在可以设置为“raise”、“ignore”或“mask”。在后一种情况下,您将获得一个与y长度相同的屏蔽数组,您可以在该数组上调用. compressim()来获取有效的索引。注意还有npi。包含(x, y)如果这就是你需要知道的。

#6


0  

A more direct solution, that doesn't expect the array to be sorted.

一个更直接的解决方案,它不期望数组被排序。

import pandas as pd
A = pd.Series(['amsterdam', 'delhi', 'chromepet', 'tokyo', 'others'])
B = pd.Series(['chromepet', 'tokyo', 'tokyo', 'delhi', 'others'])

# Find index position of B's items in A
B.map(lambda x: np.where(A==x)[0][0]).tolist()

Result is:

结果是:

[2, 3, 3, 1, 4]