如何将pandas系列列表或元组转换为一系列numpy数组

时间:2022-12-28 00:33:05

I have a csv file with x, y, and z columns that represent coordinates in a 3-dimensional space. I need to create a distance matrix from each item over all other items.

我有一个csv文件,其中x,y和z列表示三维空间中的坐标。我需要从每个项目创建一个距离矩阵,而不是所有其他项目。

I can easily read the csv with pandas read_csv function, resulting in a DataFrame like the following:

我可以使用pandas read_csv函数轻松读取csv,从而产生如下的DataFrame:

import pandas as pd
import numpy as np

samples = pd.DataFrame(
    columns=['source', 'name', 'x', 'y', 'z'],
    data = [['a', 'apple', 1.0, 2.0, 3.0],
            ['b', 'pear', 2.0, 3.0, 4.0],
            ['c', 'tomato', 9.0, 8.0, 7.0],
            ['d', 'sandwich', 6.0, 5.0, 4.0]]
)

I can then convert the separate x, y, z columns into a Series of tuples:

然后我可以将单独的x,y,z列转换为一系列元组:

samples['coord'] = samples.apply(
    lambda row: (row['x'], row['y'], row['z']),
    axis=1
)

or a Series of lists:

或一系列清单:

samples['coord'] = samples.apply(
    lambda row: [row['x'], row['y'], row['z']],
    axis=1
)

But I cannot create a Series of arrays:

但是我无法创建一系列数组:

samples['coord'] = samples.apply(
    lambda row: np.array([row['x'], row['y'], row['z']]),
    axis=1
)

I get the ValueError, "Shape of passed values is (4,3), indices imply (4,6)"

我得到ValueError,“传递值的形状是(4,3),索引暗示(4,6)”

I'd really like to have the data prepped so that I can simply call the scipy's distance_matrix function, which expects two arrays, as follows:

我真的很想准备数据,这样我就可以简单地调用scipy的distance_matrix函数,它需要两个数组,如下所示:

dmat = scipy.spatial.distance_matrix(
    samples['coord'].values,
    samples['coord'].values
)

I am, of course, open to any more pythonic or more efficient way to achieve this goal if my approach is poor.

当然,如果我的方法很差,我可以接受任何更多的pythonic或更有效的方法来实现这一目标。

2 个解决方案

#1


1  

This stores NumPy array in coords:

这将NumPy阵列存储在coords中:

samples['coord'] = list(samples[['x', 'y', 'z']].values)

Now:

>>> samples.coord[0]
array([ 1.,  2.,  3.])

#2


0  

I figured out that I can just extract a numpy array from the dataframe and use it to get the distance matrix.

我发现我可以从数据帧中提取一个numpy数组并使用它来获取距离矩阵。

sample_array = np.array(samples[['x', 'y', 'z']])
dmat = scipy.spatial.distance_matrix(sample_array, sample_array)

But I'd still like to have those little arrays embedded in the dataframe, alongside the other data, and I'd upvote and accept an answer that can do that.

但我仍然希望将这些小数组嵌入到数据框中,与其他数据一起使用,并且我会投票并接受可以做到这一点的答案。

#1


1  

This stores NumPy array in coords:

这将NumPy阵列存储在coords中:

samples['coord'] = list(samples[['x', 'y', 'z']].values)

Now:

>>> samples.coord[0]
array([ 1.,  2.,  3.])

#2


0  

I figured out that I can just extract a numpy array from the dataframe and use it to get the distance matrix.

我发现我可以从数据帧中提取一个numpy数组并使用它来获取距离矩阵。

sample_array = np.array(samples[['x', 'y', 'z']])
dmat = scipy.spatial.distance_matrix(sample_array, sample_array)

But I'd still like to have those little arrays embedded in the dataframe, alongside the other data, and I'd upvote and accept an answer that can do that.

但我仍然希望将这些小数组嵌入到数据框中,与其他数据一起使用,并且我会投票并接受可以做到这一点的答案。