
时间:2022-12-28 12:06:26

The implicit conversion of a Python sequence of variable-length lists into a NumPy array cause the array to be of type object.


v = [[1], [1, 2]]
>>> array([[1], [1, 2]], dtype=object)

Trying to force another type will cause an exception:


np.array(v, dtype=np.int32)
ValueError: setting an array element with a sequence.

What is the most efficient way to get a dense NumPy array of type int32, by filling the "missing" values with a given placeholder?


From my sample sequence v, I would like to get something like this, if 0 is the placeholder


array([[1, 0], [1, 2]], dtype=int32)

5 个解决方案



You can use itertools.zip_longest:


import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
array([[1, 0],
       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.

注意:对于Python 2,它是itertools.izip_longest。



Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

这是一个几乎*矢量化的基于布尔索引的方法,我在其他几个帖子中使用过 -

def boolean_indexing(v):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.zeros(mask.shape,dtype=int)
    out[mask] = np.concatenate(v)
    return out

Sample run


In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]

In [28]: out
array([[1, 0, 0, 0, 0],
       [1, 2, 0, 0, 0],
       [3, 6, 7, 8, 9],
       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.


Runtime test


In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

在本节中,我将基于DataFrame的解决方案由@Alberto Garcia-Raboso提供,基于itertools的解决方案由@ayhan提供,因为它们似乎可以很好地扩展,并且基于布尔索引的方法可以从这篇文章中获得一个相对较大的数据集,其中包含三个级别的大小列表元素的变化。

Case #1 : Larger size variation


In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]

In [45]: v = v*1000

In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop

In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop

In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation


In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]

In [50]: v = v*1000

In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop

In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop

In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element


In [139]: # Setup inputs
     ...: N = 10000 # Number of elems in list
     ...: maxn = 100 # Max. size of a list element
     ...: lens = np.random.randint(0,maxn,(N))
     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]

In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop

In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop

In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop

To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!




Pandas and its DataFrame-s deal beautifully with missing data.


import numpy as np
import pandas as pd

v = [[1], [1, 2]]

# array([[1, 0],
#        [1, 2]], dtype=int32)



max_len = max(len(sub_list) for sub_list in v)

result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])

>>> result
array([[1, 0],
       [1, 2]])

>>> type(result)



Here is a general way:


>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1,  0,  0,  0],
       [ 2,  3,  4,  0],
       [ 5,  6,  0,  0],
       [ 7,  8,  9, 10],
       [11, 12,  0,  0]], dtype=int32)



You can use itertools.zip_longest:


import itertools
np.array(list(itertools.zip_longest(*v, fillvalue=0))).T
array([[1, 0],
       [1, 2]])

Note: For Python 2, it is itertools.izip_longest.

注意:对于Python 2,它是itertools.izip_longest。



Here's an almost* vectorized boolean-indexing based approach that I have used in several other posts -

这是一个几乎*矢量化的基于布尔索引的方法,我在其他几个帖子中使用过 -

def boolean_indexing(v):
    lens = np.array([len(item) for item in v])
    mask = lens[:,None] > np.arange(lens.max())
    out = np.zeros(mask.shape,dtype=int)
    out[mask] = np.concatenate(v)
    return out

Sample run


In [27]: v
Out[27]: [[1], [1, 2], [3, 6, 7, 8, 9], [4]]

In [28]: out
array([[1, 0, 0, 0, 0],
       [1, 2, 0, 0, 0],
       [3, 6, 7, 8, 9],
       [4, 0, 0, 0, 0]])

*Please note that this coined as almost vectorized because the only looping performed here is at the start, where we are getting the lengths of the list elements. But that part not being so computationally demanding should have minimal effect on the total runtime.


Runtime test


In this section I am timing DataFrame-based solution by @Alberto Garcia-Raboso, itertools-based solution by @ayhan as they seem to scale well and the boolean-indexing based one from this post for a relatively larger dataset with three levels of size variation across the list elements.

在本节中,我将基于DataFrame的解决方案由@Alberto Garcia-Raboso提供,基于itertools的解决方案由@ayhan提供,因为它们似乎可以很好地扩展,并且基于布尔索引的方法可以从这篇文章中获得一个相对较大的数据集,其中包含三个级别的大小列表元素的变化。

Case #1 : Larger size variation


In [44]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8,9,3,6,4,8,3,2,4,5,6,6,8,7,9,3,6,4]]

In [45]: v = v*1000

In [46]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 9.82 ms per loop

In [47]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
100 loops, best of 3: 5.11 ms per loop

In [48]: %timeit boolean_indexing(v)
100 loops, best of 3: 6.88 ms per loop

Case #2 : Lesser size variation


In [49]: v = [[1], [1,2,4,8,4],[6,7,3,6,7,8]]

In [50]: v = v*1000

In [51]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
100 loops, best of 3: 3.12 ms per loop

In [52]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1000 loops, best of 3: 1.55 ms per loop

In [53]: %timeit boolean_indexing(v)
100 loops, best of 3: 5 ms per loop

Case #3 : Larger number of elements (100 max) per list element


In [139]: # Setup inputs
     ...: N = 10000 # Number of elems in list
     ...: maxn = 100 # Max. size of a list element
     ...: lens = np.random.randint(0,maxn,(N))
     ...: v = [list(np.random.randint(0,9,(L))) for L in lens]

In [140]: %timeit pd.DataFrame(v).fillna(0).values.astype(np.int32)
1 loops, best of 3: 292 ms per loop

In [141]: %timeit np.array(list(itertools.izip_longest(*v, fillvalue=0))).T
1 loops, best of 3: 264 ms per loop

In [142]: %timeit boolean_indexing(v)
10 loops, best of 3: 95.7 ms per loop

To me, it seems itertools.izip_longest is doing pretty well! there's no clear winner, but would have to be taken on a case-by-case basis!




Pandas and its DataFrame-s deal beautifully with missing data.


import numpy as np
import pandas as pd

v = [[1], [1, 2]]

# array([[1, 0],
#        [1, 2]], dtype=int32)



max_len = max(len(sub_list) for sub_list in v)

result = np.array([sub_list + [0] * (max_len - len(sub_list)) for sub_list in v])

>>> result
array([[1, 0],
       [1, 2]])

>>> type(result)



Here is a general way:


>>> v = [[1], [2, 3, 4], [5, 6], [7, 8, 9, 10], [11, 12]]
>>> max_len = np.argmax(v)
>>> np.hstack(np.insert(v, range(1, len(v)+1),[[0]*(max_len-len(i)) for i in v])).astype('int32').reshape(len(v), max_len)
array([[ 1,  0,  0,  0],
       [ 2,  3,  4,  0],
       [ 5,  6,  0,  0],
       [ 7,  8,  9, 10],
       [11, 12,  0,  0]], dtype=int32)