将Python序列转换为NumPy数组,填充缺失值

时间:2022-12-28 12:06:26

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.

将可变长度列表的Python序列隐式转换为NumPy数组会导致该数组为object类型。

v = [[1], [1, 2]]
np.array(v)
>>> array([[1], [1, 2]], dtype=object)

Trying to force another type will cause an exception:

试图强制另一种类型会导致异常:

np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.

What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?

通过使用给定的占位符填充“缺失”值,获得int32类型的密集NumPy数组的最有效方法是什么?

From my sample sequence v, I would like to get something like this, if 0 is the placeholder

从我的样本序列v,我想得到这样的东西,如果0是占位符

array([[1, 0], [1, 2]], dtype=int32)

5 个解决方案

#1


13  

You can use itertools.zip_longest:

你可以使用itertools.zip_longest:

import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out: 
array([[1, 0],
       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.

注意:对于Python 2,它是itertools.izip_longest。

#2


11  

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

这是一个几乎*矢量化的基于布尔索引的方法,我在其他几个帖子中使用过 -

def boolean_indexing(v):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.zeros(mask.shape,dtype=int)
    out[mask] = np.concatenate(v)
    return out

Sample run

样品运行

In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]

In [28]: out
Out[28]: 
array([[1, 0, 0, 0, 0],
       [1, 2, 0, 0, 0],
       [3, 6, 7, 8, 9],
       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

*请注意,这创造了几乎矢量化,因为这里执行的唯一循环是在开始,我们得到列表元素的长度。但是那部分计算要求不高的部分应该对总运行时间的影响最小。

Runtime test

运行时测试

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

在本节中,我将基于DataFrame的解决方案由@Alberto Garcia-Raboso提供,基于itertools的解决方案由@ayhan提供,因为它们似乎可以很好地扩展,并且基于布尔索引的方法可以从这篇文章中获得一个相对较大的数据集,其中包含三个级别的大小列表元素的变化。

Case #1 : Larger size variation

案例#1:尺寸变化较大

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]

In [45]: v = v*1000

In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop

In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop

In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation

案例#2:较小的尺寸变化

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]

In [50]: v = v*1000

In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop

In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop

In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element

案例#3:每个列表元素的元素数量更多(最多100个)

In [139]: # Setup inputs
     ...: N = 10000 # Number of elems in list
     ...: maxn = 100 # Max. size of a list element
     ...: lens = np.random.randint(0,maxn,(N))
     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]
     ...: 

In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop

In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop

In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop

To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!

对我来说,似乎itertools.izip_longest做得很好!没有明确的赢家,但必须根据具体情况采取行动!

#3


11  

Pandas and its DataFrame-s deal beautifully with missing data.

Pandas及其DataFrame-s可以很好地处理丢失的数据。

import numpy as np
import pandas as pd

v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))

# array([[1, 0],
#        [1, 2]], dtype=int32)

#4


2  

max_len = max(len(sub_list) for sub_list in v)

result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])

>>> result
array([[1, 0],
       [1, 2]])

>>> type(result)
numpy.ndarray

#5


2  

Here is a general way:

这是一般方法:

>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1,  0,  0,  0],
       [ 2,  3,  4,  0],
       [ 5,  6,  0,  0],
       [ 7,  8,  9, 10],
       [11, 12,  0,  0]], dtype=int32)

#1


13  

You can use itertools.zip_longest:

你可以使用itertools.zip_longest:

import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
Out: 
array([[1, 0],
       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.

注意:对于Python 2,它是itertools.izip_longest。

#2


11  

Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

这是一个几乎*矢量化的基于布尔索引的方法,我在其他几个帖子中使用过 -

def boolean_indexing(v):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.zeros(mask.shape,dtype=int)
    out[mask] = np.concatenate(v)
    return out

Sample run

样品运行

In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]

In [28]: out
Out[28]: 
array([[1, 0, 0, 0, 0],
       [1, 2, 0, 0, 0],
       [3, 6, 7, 8, 9],
       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.

*请注意,这创造了几乎矢量化,因为这里执行的唯一循环是在开始,我们得到列表元素的长度。但是那部分计算要求不高的部分应该对总运行时间的影响最小。

Runtime test

运行时测试

In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

在本节中,我将基于DataFrame的解决方案由@Alberto Garcia-Raboso提供,基于itertools的解决方案由@ayhan提供,因为它们似乎可以很好地扩展,并且基于布尔索引的方法可以从这篇文章中获得一个相对较大的数据集,其中包含三个级别的大小列表元素的变化。

Case #1 : Larger size variation

案例#1:尺寸变化较大

In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]

In [45]: v = v*1000

In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop

In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop

In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation

案例#2:较小的尺寸变化

In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]

In [50]: v = v*1000

In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop

In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop

In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element

案例#3:每个列表元素的元素数量更多(最多100个)

In [139]: # Setup inputs
     ...: N = 10000 # Number of elems in list
     ...: maxn = 100 # Max. size of a list element
     ...: lens = np.random.randint(0,maxn,(N))
     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]
     ...: 

In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop

In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop

In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop

To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!

对我来说,似乎itertools.izip_longest做得很好!没有明确的赢家,但必须根据具体情况采取行动!

#3


11  

Pandas and its DataFrame-s deal beautifully with missing data.

Pandas及其DataFrame-s可以很好地处理丢失的数据。

import numpy as np
import pandas as pd

v = [[1], [1, 2]]
print(pd.DataFrame(v).fillna(0).values.astype(np.int32))

# array([[1, 0],
#        [1, 2]], dtype=int32)

#4


2  

max_len = max(len(sub_list) for sub_list in v)

result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])

>>> result
array([[1, 0],
       [1, 2]])

>>> type(result)
numpy.ndarray

#5


2  

Here is a general way:

这是一般方法:

>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1,  0,  0,  0],
       [ 2,  3,  4,  0],
       [ 5,  6,  0,  0],
       [ 7,  8,  9, 10],
       [11, 12,  0,  0]], dtype=int32)