如何使用每列的不同默认值初始化NumPy结构化数组?

时间:2021-09-16 12:54:07

I'm trying to initialize a NumPy structured matrix of size (x,y) where the value of x is ~ 10^3 and y's value is ~ 10^6.

我正在尝试初始化大小为(x,y)的NumPy结构化矩阵,其中x的值为~10 ^ 3且y的值为~10 ^ 6。

The first column of the matrix is an ID (integer), and the rest are triplets (int8), where each member of the triplet should have a different default value.

矩阵的第一列是ID(整数),其余是三元组(int8),其中三元组的每个成员应具有不同的默认值。

i.e. assuming the default values are [2,5,9] I'd like to initialize the following matrix:

即假设默认值为[2,5,9]我想初始化以下矩阵:

0 2 5 9 2 5 9 2 5 9 ...
0 2 5 9 2 5 9 2 5 9 ...
0 2 5 9 2 5 9 2 5 9 ...
0 2 5 9 2 5 9 2 5 9 ...
...

The problem here (VS. this similar question) is that each column has a different unique name that should be recorded.

这里的问题(VS.这个类似的问题)是每列都有一个应该记录的不同的唯一名称。

The fastest way I could think of initializing the matrix is:

我能想到初始化矩阵的最快方法是:

default_age       = 2
default_height    = 5
default_shoe_size = 9

columns = ["id", 
           "a_age", 
           "a_height", 
           "a_shoe_size", 
           "b_age", 
           "b_height", 
           "b_shoe_size",
           #...
           ]

y = len(columns)    
x = 10**4

# generate matrix
mat = numpy.zeros(shape=x,
                  dtype={"names"   : columns,
                         "formats" : ['i'] + ['int8'] * (len(columns) - 1)})
# fill the triplets with default values
for i in xrange(y/3):
    j = i * 3
    mat[mat.dtype.names[j+1]] = default_age
    mat[mat.dtype.names[j+2]] = default_height
    mat[mat.dtype.names[j+3]] = default_shoe_size

What is the fastest way to initialize such a matrix?

初始化这种矩阵的最快方法是什么?

Thanks!

4 个解决方案

#1


This is my tweak of your sample, adjusted so it runs. Note that I iterate over the columns by field name

这是我对你的样本的调整,调整后运行。请注意,我按字段名称迭代列

dt=np.dtype({"names": columns, "formats" : ['i'] + ['int8'] * (len(columns) - 1)})
mat=np.zeros((10,),dtype=dt)
for i in range(1,7,3):
    mat[dt.names[i]]=default_age
    mat[dt.names[i+1]]=default_height
    mat[dt.names[i+2]]=default_shoe_size

producing

array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('collections.ChainMap(np.arange(6).reshape(3,2))[0]_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

As long as the number of field names is substantially few than the number of rows, I think this will be as fast, or faster, than any other way.

只要字段名称的数量远远少于行数,我认为这将比任何其他方式更快或更快。

In my sample x=(10,). Your mat[:,j+1] expression has not been corrected to handle a structured 1d array.

在我的样本中x =(10,)。您的mat [:,j + 1]表达式尚未更正以处理结构化的1d数组。

A structured array is probably not the best way to go if you have very many columns (fields) (compared to the number of rows).

如果您有很多列(字段)(与行数相比),结构化数组可能不是最好的方法。

If all of your fields are 'int', I'd use a regular 2d array. Structured arrays are most useful when fields have differing types of elements.

如果你的所有字段都是'int',我会使用常规的2d数组。当字段具有不同类型的元素时,结构化数组最有用。


Here's a way of initializing a regular 2d array with these values, and optionally casting it to a structured array

这是一种使用这些值初始化常规2d数组的方法,并可选择将其转换为结构化数组

values=np.array([2,5,9])
x, y = 10, 2
mat1=np.repeat(np.repeat(values[None,:],y,0).reshape(1,3*y),x,0)

producing:

array([[2, 5, 9, 2, 5, 9],
       [2, 5, 9, 2, 5, 9],
       ...,
       [2, 5, 9, 2, 5, 9]])

Add on the id column

在id列上添加

mat1=np.concatenate([np.zeros((x,1),int),mat1],1)
array([[0, 2, 5, 9, 2, 5, 9],
       [0, 2, 5, 9, 2, 5, 9],
       ...
       [0, 2, 5, 9, 2, 5, 9],
       [0, 2, 5, 9, 2, 5, 9]])

A new dtype - with all plain 'int':

一个新的dtype - 所有普通的'int':

dt1=np.dtype({"names"   : columns, "formats" : ['i'] + ['int'] * (len(columns) - 1)})
mat2=np.empty((x,),dtype=dt1)

If done right, the data for mat1 should be the same size and byte order as for mat2. In which case I can 'copy' it (actually just change pointers).

如果完成,则mat1的数据应与mat2的大小和字节顺序相同。在这种情况下,我可以“复制”它(实际上只是改变指针)。

mat2.data=mat1.data

mat2 looks just like the earlier mat, except the dtype is a little different (with i4 instead of i1 fields)

mat2看起来就像早期的mat,除了dtype有点不同(用i4而不是i1字段)

array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', '<i4'), ('a_height', '<i4'), ('a_shoe_size', '<i4'), ('b_age', '<i4'), ('b_height', '<i4'), ('b_shoe_size', '<i4')])

Another way to use mat1 values to initialize a structured array is with an intermediary list of tuples:

使用mat1值初始化结构化数组的另一种方法是使用元组的中间列表:

np.array([tuple(row) for row in mat1],dtype=dt)
array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

I haven't run time tests, in part because I don't have an idea of what your x,y values are like.

我没有进行时间测试,部分原因是因为我不知道你的x,y值是什么样的。

Convert structured array with various numeric data types to regular array

将具有各种数字数据类型的结构化数组转换为常规数组

or from the answer in https://*.com/a/21818731/901925, the np.ndarray constructor can be used to create a new array using preexisting data buffer. It still needs to use dt1, the all i8 dtype.

或者从https://*.com/a/21818731/901925中的答案,np.ndarray构造函数可用于使用预先存在的数据缓冲区创建新数组。它仍然需要使用dt1,所有i8 dtype。

np.ndarray((x,), dt1, mat1)

Also ndarray to structured_array and float to int, with more on using view v. astype for this conversion.

同样ndarray到structured_array并浮动到int,更多关于使用view v.astype进行此转换。

#2


You can build up an array using the usual tile and column_stack provided by numpy, then use np.core.records.fromarrays:

您可以使用numpy提供的常用tile和column_stack构建数组,然后使用np.core.records.fromarrays:

import numpy as np

default_age       = 2
default_height    = 5
default_shoe_size = 9
n_rows = 10

columns = [
    "id", 
    "a_age", 
    "a_height", 
    "a_shoe_size", 
    "b_age", 
    "b_height", 
    "b_shoe_size",
    ]

# generate matrix
dtype = {
    "names": columns,
    "formats": ['i'] + ['int8'] * (len(columns) - 1)
    }

ids = np.zeros(n_rows)
people = np.tile([default_age, default_height, default_shoe_size], (n_rows,2))
data = np.column_stack((ids, people))

mat = np.core.records.fromarrays(list(data.T), dtype=dtype)

Which gives:

>>> mat
rec.array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

#3


You could use an enum to represent the column names

您可以使用枚举来表示列名称

class Columns(Enum):
    id = 0
    a_age = 1
    a_height = 2
    a_shoe_size = 3
    b_age = 4
    b_height = 5
    b_shoe_size = 6
    ...

Then use the normal array of arrays initialization and access syntax, or whatever object you want to use. Just in place of the column index, you would use Columns.a_age for example. For more information on enums, check here How can I represent an 'Enum' in Python?

然后使用正常的数组数组初始化和访问语法,或者您想要使用的任何对象。只需代替列索引,就可以使用Columns.a_age。有关枚举的更多信息,请查看此处如何在Python中表示“枚举”?

#4


You can fill in default values, with a for-loop. If you have the default values for example in a dictionary:

您可以使用for循环填充默认值。如果您在字典中具有默认值:

default_values = {
    "a_age": 3,
    "a_height": 5,
}
for column, value in default_values.items():
    mat[column] = value

#1


This is my tweak of your sample, adjusted so it runs. Note that I iterate over the columns by field name

这是我对你的样本的调整,调整后运行。请注意,我按字段名称迭代列

dt=np.dtype({"names": columns, "formats" : ['i'] + ['int8'] * (len(columns) - 1)})
mat=np.zeros((10,),dtype=dt)
for i in range(1,7,3):
    mat[dt.names[i]]=default_age
    mat[dt.names[i+1]]=default_height
    mat[dt.names[i+2]]=default_shoe_size

producing

array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('collections.ChainMap(np.arange(6).reshape(3,2))[0]_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

As long as the number of field names is substantially few than the number of rows, I think this will be as fast, or faster, than any other way.

只要字段名称的数量远远少于行数,我认为这将比任何其他方式更快或更快。

In my sample x=(10,). Your mat[:,j+1] expression has not been corrected to handle a structured 1d array.

在我的样本中x =(10,)。您的mat [:,j + 1]表达式尚未更正以处理结构化的1d数组。

A structured array is probably not the best way to go if you have very many columns (fields) (compared to the number of rows).

如果您有很多列(字段)(与行数相比),结构化数组可能不是最好的方法。

If all of your fields are 'int', I'd use a regular 2d array. Structured arrays are most useful when fields have differing types of elements.

如果你的所有字段都是'int',我会使用常规的2d数组。当字段具有不同类型的元素时,结构化数组最有用。


Here's a way of initializing a regular 2d array with these values, and optionally casting it to a structured array

这是一种使用这些值初始化常规2d数组的方法,并可选择将其转换为结构化数组

values=np.array([2,5,9])
x, y = 10, 2
mat1=np.repeat(np.repeat(values[None,:],y,0).reshape(1,3*y),x,0)

producing:

array([[2, 5, 9, 2, 5, 9],
       [2, 5, 9, 2, 5, 9],
       ...,
       [2, 5, 9, 2, 5, 9]])

Add on the id column

在id列上添加

mat1=np.concatenate([np.zeros((x,1),int),mat1],1)
array([[0, 2, 5, 9, 2, 5, 9],
       [0, 2, 5, 9, 2, 5, 9],
       ...
       [0, 2, 5, 9, 2, 5, 9],
       [0, 2, 5, 9, 2, 5, 9]])

A new dtype - with all plain 'int':

一个新的dtype - 所有普通的'int':

dt1=np.dtype({"names"   : columns, "formats" : ['i'] + ['int'] * (len(columns) - 1)})
mat2=np.empty((x,),dtype=dt1)

If done right, the data for mat1 should be the same size and byte order as for mat2. In which case I can 'copy' it (actually just change pointers).

如果完成,则mat1的数据应与mat2的大小和字节顺序相同。在这种情况下,我可以“复制”它(实际上只是改变指针)。

mat2.data=mat1.data

mat2 looks just like the earlier mat, except the dtype is a little different (with i4 instead of i1 fields)

mat2看起来就像早期的mat,除了dtype有点不同(用i4而不是i1字段)

array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', '<i4'), ('a_height', '<i4'), ('a_shoe_size', '<i4'), ('b_age', '<i4'), ('b_height', '<i4'), ('b_shoe_size', '<i4')])

Another way to use mat1 values to initialize a structured array is with an intermediary list of tuples:

使用mat1值初始化结构化数组的另一种方法是使用元组的中间列表:

np.array([tuple(row) for row in mat1],dtype=dt)
array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

I haven't run time tests, in part because I don't have an idea of what your x,y values are like.

我没有进行时间测试,部分原因是因为我不知道你的x,y值是什么样的。

Convert structured array with various numeric data types to regular array

将具有各种数字数据类型的结构化数组转换为常规数组

or from the answer in https://*.com/a/21818731/901925, the np.ndarray constructor can be used to create a new array using preexisting data buffer. It still needs to use dt1, the all i8 dtype.

或者从https://*.com/a/21818731/901925中的答案,np.ndarray构造函数可用于使用预先存在的数据缓冲区创建新数组。它仍然需要使用dt1,所有i8 dtype。

np.ndarray((x,), dt1, mat1)

Also ndarray to structured_array and float to int, with more on using view v. astype for this conversion.

同样ndarray到structured_array并浮动到int,更多关于使用view v.astype进行此转换。

#2


You can build up an array using the usual tile and column_stack provided by numpy, then use np.core.records.fromarrays:

您可以使用numpy提供的常用tile和column_stack构建数组,然后使用np.core.records.fromarrays:

import numpy as np

default_age       = 2
default_height    = 5
default_shoe_size = 9
n_rows = 10

columns = [
    "id", 
    "a_age", 
    "a_height", 
    "a_shoe_size", 
    "b_age", 
    "b_height", 
    "b_shoe_size",
    ]

# generate matrix
dtype = {
    "names": columns,
    "formats": ['i'] + ['int8'] * (len(columns) - 1)
    }

ids = np.zeros(n_rows)
people = np.tile([default_age, default_height, default_shoe_size], (n_rows,2))
data = np.column_stack((ids, people))

mat = np.core.records.fromarrays(list(data.T), dtype=dtype)

Which gives:

>>> mat
rec.array([(0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9), (0, 2, 5, 9, 2, 5, 9),
       (0, 2, 5, 9, 2, 5, 9)], 
      dtype=[('id', '<i4'), ('a_age', 'i1'), ('a_height', 'i1'), ('a_shoe_size', 'i1'), ('b_age', 'i1'), ('b_height', 'i1'), ('b_shoe_size', 'i1')])

#3


You could use an enum to represent the column names

您可以使用枚举来表示列名称

class Columns(Enum):
    id = 0
    a_age = 1
    a_height = 2
    a_shoe_size = 3
    b_age = 4
    b_height = 5
    b_shoe_size = 6
    ...

Then use the normal array of arrays initialization and access syntax, or whatever object you want to use. Just in place of the column index, you would use Columns.a_age for example. For more information on enums, check here How can I represent an 'Enum' in Python?

然后使用正常的数组数组初始化和访问语法,或者您想要使用的任何对象。只需代替列索引,就可以使用Columns.a_age。有关枚举的更多信息,请查看此处如何在Python中表示“枚举”?

#4


You can fill in default values, with a for-loop. If you have the default values for example in a dictionary:

您可以使用for循环填充默认值。如果您在字典中具有默认值:

default_values = {
    "a_age": 3,
    "a_height": 5,
}
for column, value in default_values.items():
    mat[column] = value