跨多个pandas列的平均行子集

时间:2022-12-13 22:58:31

I have a dataset of geographical data over which I am trying to smooth. To do this I am finding all the nearest neighbours within some radius r for each row, then selecting those rows and taking a mean and adding it as a column to the original dataframe. The code to do so is

我有一个地理数据的数据集,我试图平滑。为此,我在每个行的某个半径r内找到所有最近邻居,然后选择这些行并取平均值并将其作为列添加到原始数据帧。这样做的代码是

import pandas as pd
import numpy as np
import scipy.spatial as spatial

d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)

factor = ["factor1", "factor2"]
dist = [2,1.5]

X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)
for i in dist:
    for j in factor:
        df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)

This currently works fine but takes time as it has to loop over each feature to average it. However as I am already finding the nearest neighbours (the bit that takes time), there might be some way to select all the nearest neighbour rows and average all the columns at once and add them to the dataset, but I cannot work out how/if this can be done. I have tried finding all the indicies of nearest neighbours for each row and storing them in the dataset inside the i loop, but this takes up to much memory and crashes.

这当前工作正常但需要时间,因为它必须遍历每个功能以平均它。然而,由于我已经找到了最近的邻居(需要时间的位),可能有一些方法可以选择所有最近的邻居行并一次平均所有列并将它们添加到数据集中,但我无法弄清楚如何/如果可以做到这一点。我已经尝试找到每行的最近邻居的所有标记并将它们存储在i循环内的数据集中,但这会占用大量内存和崩溃。

I just feel that this can be done better

我觉得这可以做得更好

1 个解决方案

#1


0  

I see a minor (~20%) improvement by using a list comprehension instead.

我通过使用列表理解来看到一个小的(约20%)改进。

But check how this scales with your full dataset.

但请检查它如何与您的完整数据集进行缩放。

import pandas as pd
import numpy as np
import scipy.spatial as spatial

d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)

factor = ["factor1", "factor2"]
dist = [2,1.5]

X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)

def original(df):
    for i in dist:
        for j in factor:
            df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)
    return df

def jp(df):
    calc = tree.query_ball_point    
    for i in dist:
        for j in factor:
            df_filter = df[j]
            df[j + "_Mean_" + str(i)] = [df_filter[calc([x, y],i)].mean() for x, y in zip(df['x'], df['y'])]
    return df

%timeit original(df)  # 100 loops, best of 3: 13.1 ms per loop
%timeit jp(df)        # 100 loops, best of 3: 10.9 ms per loop

#1


0  

I see a minor (~20%) improvement by using a list comprehension instead.

我通过使用列表理解来看到一个小的(约20%)改进。

But check how this scales with your full dataset.

但请检查它如何与您的完整数据集进行缩放。

import pandas as pd
import numpy as np
import scipy.spatial as spatial

d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)

factor = ["factor1", "factor2"]
dist = [2,1.5]

X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)

def original(df):
    for i in dist:
        for j in factor:
            df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)
    return df

def jp(df):
    calc = tree.query_ball_point    
    for i in dist:
        for j in factor:
            df_filter = df[j]
            df[j + "_Mean_" + str(i)] = [df_filter[calc([x, y],i)].mean() for x, y in zip(df['x'], df['y'])]
    return df

%timeit original(df)  # 100 loops, best of 3: 13.1 ms per loop
%timeit jp(df)        # 100 loops, best of 3: 10.9 ms per loop