用numpy,用python和高效的方法来做一个元素级的“in”

时间:2022-02-10 21:37:44

I'm looking for a way to efficiently get an array of booleans, where given two arrays with equal size a and b, each element is true if the corresponding element of a appears in the corresponding element of b.

我在寻找一种有效获取布尔数组的方法,在给定两个大小为a和b的数组时,如果a的对应元素出现在b的对应元素中,那么每个元素都是正确的。

For example, the following program:

例如,下面的程序:

a = numpy.array([1, 2, 3, 4])
b = numpy.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])
print(numpy.magic_function(a, b))

Should print

应该打印

[True, True, False, False]

Keep in mind this function should be the equivalent of

记住这个函数应该等于

[x in y for x, y in zip(a, b)]

Only numpy-optimized for cases when a and b are big, and each element of b is reasonably small.

只有当a和b都很大,并且b的每个元素都相当小时,才会进行numpy优化。

3 个解决方案

#1


4  

To take advantage of NumPy's broadcasting rules you should make array b squared first, which can be achieved using itertools.izip_longest:

为了利用NumPy的广播规则,您应该首先创建数组b的平方,可以使用itertools.izip_longest:

from itertools import izip_longest

c = np.array(list(izip_longest(*b))).astype(float)

resulting in:

导致:

array([[  1.,   2.,   5.,   7.],
       [  2.,   8.,   6.,  nan],
       [ 13.,   9.,  nan,  nan]])

Then, by doing np.isclose(c, a) you get a 2D array of Booleans showing the difference between each c[:, i] and a[i], according to the broadcasting rules, giving:

然后,通过np。isclose(c, a)根据广播规则,你会得到一个二维布尔数组,显示每个c[:, i]和a[i]之间的差异,给出:

array([[ True,  True, False, False],
       [False, False, False, False],
       [False, False, False, False]], dtype=bool)

Which can be used to obtain your answer:

你可藉此获得你的答案:

np.any(np.isclose(c, a), axis=0)
#array([ True,  True, False, False], dtype=bool)

#2


3  

Is there an upper limit to the length of the small lists in b? If so, maybe you could make b a matrix of say 1000x5, and use nan to fill the gaps for the sub-arrays that are too short. You can then use numpy.any to get the answer you want, something like this:

b中的小列表的长度有上限吗?如果是这样的话,也许你可以把b变成一个比方说1000x5的矩阵,然后用nan来填充太短的子数组之间的空隙。然后可以使用numpy。想要得到你想要的答案,可以这样:

In [42]: a = np.array([1, 2, 3, 4])
    ...: b = np.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])

In [43]: bb = np.full((len(b), max(len(i) for i in b)), np.nan)

In [44]: for irow, row in enumerate(b):
    ...:     bb[irow, :len(row)] = row

In [45]: bb
Out[45]: 
array([[  1.,   2.,  13.],
       [  2.,   8.,   9.],
       [  5.,   6.,  nan],
       [  7.,  nan,  nan]])

In [46]: a[:,np.newaxis] == bb
Out[46]: 
array([[ True, False, False],
       [ True, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

In [47]: np.any(a[:,np.newaxis] == bb, axis=1)
Out[47]: array([ True,  True, False, False], dtype=bool)

No idea if this is faster for your data.

不知道你的数据是否更快。

#3


1  

Summary

The approach from Sauldo Castro runs most quickly among those posted so far. The generator expression in the original post is second fastest.

从Sauldo Castro的方法到目前为止发布的最快。原始post中的生成器表达式是第二快的。

Code to generate test data:

import numpy
import random

alength = 100
a = numpy.array([random.randint(1, 6) for i in range(alength)])
b = []
for i in range(alength):
    length = random.randint(1, 5)
    element = []
    for i in range(length):
        element.append(random.randint(1, 6))
    b.append(element)
b = numpy.array(b)
print a, b

The options:

from itertools import izip_longest
def magic_function1(a, b): # From OP Martin Fixman
    return [x in y for x, y in zip(a, b)]  

def magic_function2(a, b): # What I thought might be better.
    bools = []
    for x, y in zip(a,b):
        found = False
        for j in y:
            if x == j:
                found=True
                break
        bools.append(found)

def magic_function3(a, b): # What I tried first
    bools = []
    for i in range(len(a)):
        found = False
        for j in range(len(b[i])):
            if a[i] == b[i][j]:
                found=True
                break
        bools.append(found)

def magic_function4(a, b): # From Bas Swinkels
    bb = numpy.full((len(b), max(len(i) for i in b)), numpy.nan)
    for irow, row in enumerate(b):
        bb[irow, :len(row)] = row
    a[:,numpy.newaxis] == bb
    return numpy.any(a[:,numpy.newaxis] == bb, axis=1)

def magic_function5(a, b): # From Sauldo Castro, revised version
    c = numpy.array(list(izip_longest(*b))).astype(float)
    return numpy.isclose(c, a), axis=0)  

Time n_executions

n_executions = 100
clock = timeit.Timer(stmt="magic_function1(a, b)", setup="from __main__ import magic_function1, a, b")
print clock.timeit(n_executions), "seconds"
# Repeat with each candidate function

The results:

  • 0.158078225475 seconds for magic_function1
  • 0.158078225475秒magic_function1
  • 0.181080926835 seconds for magic_function2
  • 0.181080926835秒magic_function2
  • 0.259621047822 seconds for magic_function3
  • 0.259621047822秒magic_function3
  • 0.287054750224 seconds for magic_function4
  • 0.287054750224秒magic_function4
  • 0.0839162196207 seconds for magic_function5
  • 0.0839162196207秒magic_function5

#1


4  

To take advantage of NumPy's broadcasting rules you should make array b squared first, which can be achieved using itertools.izip_longest:

为了利用NumPy的广播规则,您应该首先创建数组b的平方,可以使用itertools.izip_longest:

from itertools import izip_longest

c = np.array(list(izip_longest(*b))).astype(float)

resulting in:

导致:

array([[  1.,   2.,   5.,   7.],
       [  2.,   8.,   6.,  nan],
       [ 13.,   9.,  nan,  nan]])

Then, by doing np.isclose(c, a) you get a 2D array of Booleans showing the difference between each c[:, i] and a[i], according to the broadcasting rules, giving:

然后,通过np。isclose(c, a)根据广播规则,你会得到一个二维布尔数组,显示每个c[:, i]和a[i]之间的差异,给出:

array([[ True,  True, False, False],
       [False, False, False, False],
       [False, False, False, False]], dtype=bool)

Which can be used to obtain your answer:

你可藉此获得你的答案:

np.any(np.isclose(c, a), axis=0)
#array([ True,  True, False, False], dtype=bool)

#2


3  

Is there an upper limit to the length of the small lists in b? If so, maybe you could make b a matrix of say 1000x5, and use nan to fill the gaps for the sub-arrays that are too short. You can then use numpy.any to get the answer you want, something like this:

b中的小列表的长度有上限吗?如果是这样的话,也许你可以把b变成一个比方说1000x5的矩阵,然后用nan来填充太短的子数组之间的空隙。然后可以使用numpy。想要得到你想要的答案,可以这样:

In [42]: a = np.array([1, 2, 3, 4])
    ...: b = np.array([[1, 2, 13], [2, 8, 9], [5, 6], [7]])

In [43]: bb = np.full((len(b), max(len(i) for i in b)), np.nan)

In [44]: for irow, row in enumerate(b):
    ...:     bb[irow, :len(row)] = row

In [45]: bb
Out[45]: 
array([[  1.,   2.,  13.],
       [  2.,   8.,   9.],
       [  5.,   6.,  nan],
       [  7.,  nan,  nan]])

In [46]: a[:,np.newaxis] == bb
Out[46]: 
array([[ True, False, False],
       [ True, False, False],
       [False, False, False],
       [False, False, False]], dtype=bool)

In [47]: np.any(a[:,np.newaxis] == bb, axis=1)
Out[47]: array([ True,  True, False, False], dtype=bool)

No idea if this is faster for your data.

不知道你的数据是否更快。

#3


1  

Summary

The approach from Sauldo Castro runs most quickly among those posted so far. The generator expression in the original post is second fastest.

从Sauldo Castro的方法到目前为止发布的最快。原始post中的生成器表达式是第二快的。

Code to generate test data:

import numpy
import random

alength = 100
a = numpy.array([random.randint(1, 6) for i in range(alength)])
b = []
for i in range(alength):
    length = random.randint(1, 5)
    element = []
    for i in range(length):
        element.append(random.randint(1, 6))
    b.append(element)
b = numpy.array(b)
print a, b

The options:

from itertools import izip_longest
def magic_function1(a, b): # From OP Martin Fixman
    return [x in y for x, y in zip(a, b)]  

def magic_function2(a, b): # What I thought might be better.
    bools = []
    for x, y in zip(a,b):
        found = False
        for j in y:
            if x == j:
                found=True
                break
        bools.append(found)

def magic_function3(a, b): # What I tried first
    bools = []
    for i in range(len(a)):
        found = False
        for j in range(len(b[i])):
            if a[i] == b[i][j]:
                found=True
                break
        bools.append(found)

def magic_function4(a, b): # From Bas Swinkels
    bb = numpy.full((len(b), max(len(i) for i in b)), numpy.nan)
    for irow, row in enumerate(b):
        bb[irow, :len(row)] = row
    a[:,numpy.newaxis] == bb
    return numpy.any(a[:,numpy.newaxis] == bb, axis=1)

def magic_function5(a, b): # From Sauldo Castro, revised version
    c = numpy.array(list(izip_longest(*b))).astype(float)
    return numpy.isclose(c, a), axis=0)  

Time n_executions

n_executions = 100
clock = timeit.Timer(stmt="magic_function1(a, b)", setup="from __main__ import magic_function1, a, b")
print clock.timeit(n_executions), "seconds"
# Repeat with each candidate function

The results:

  • 0.158078225475 seconds for magic_function1
  • 0.158078225475秒magic_function1
  • 0.181080926835 seconds for magic_function2
  • 0.181080926835秒magic_function2
  • 0.259621047822 seconds for magic_function3
  • 0.259621047822秒magic_function3
  • 0.287054750224 seconds for magic_function4
  • 0.287054750224秒magic_function4
  • 0.0839162196207 seconds for magic_function5
  • 0.0839162196207秒magic_function5