
时间:2022-12-13 22:58:31

I have a dataset of geographical data over which I am trying to smooth. To do this I am finding all the nearest neighbours within some radius r for each row, then selecting those rows and taking a mean and adding it as a column to the original dataframe. The code to do so is


import pandas as pd
import numpy as np
import scipy.spatial as spatial

d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)

factor = ["factor1", "factor2"]
dist = [2,1.5]

X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)
for i in dist:
    for j in factor:
        df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)

This currently works fine but takes time as it has to loop over each feature to average it. However as I am already finding the nearest neighbours (the bit that takes time), there might be some way to select all the nearest neighbour rows and average all the columns at once and add them to the dataset, but I cannot work out how/if this can be done. I have tried finding all the indicies of nearest neighbours for each row and storing them in the dataset inside the i loop, but this takes up to much memory and crashes.


I just feel that this can be done better


1 个解决方案



I see a minor (~20%) improvement by using a list comprehension instead.


But check how this scales with your full dataset.


import pandas as pd
import numpy as np
import scipy.spatial as spatial

d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)

factor = ["factor1", "factor2"]
dist = [2,1.5]

X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)

def original(df):
    for i in dist:
        for j in factor:
            df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)
    return df

def jp(df):
    calc = tree.query_ball_point    
    for i in dist:
        for j in factor:
            df_filter = df[j]
            df[j + "_Mean_" + str(i)] = [df_filter[calc([x, y],i)].mean() for x, y in zip(df['x'], df['y'])]
    return df

%timeit original(df)  # 100 loops, best of 3: 13.1 ms per loop
%timeit jp(df)        # 100 loops, best of 3: 10.9 ms per loop



I see a minor (~20%) improvement by using a list comprehension instead.


But check how this scales with your full dataset.


import pandas as pd
import numpy as np
import scipy.spatial as spatial

d = {'id': [1,2,3,4,5], 'x': [1,2,3,3,4], 'y': [1,3,2,3,4], 'factor1':[4,5,2,7,4], 'factor2':[6,4,8,3,2]}
df = pd.DataFrame(data=d)

factor = ["factor1", "factor2"]
dist = [2,1.5]

X=np.transpose(np.array([df.x, df.y]))
tree = spatial.cKDTree(X)

def original(df):
    for i in dist:
        for j in factor:
            df[j + "_Mean_" + str(i)] = df.apply(lambda row: df[j][tree.query_ball_point([row.x, row.y],i)].mean(), axis=1)
    return df

def jp(df):
    calc = tree.query_ball_point    
    for i in dist:
        for j in factor:
            df_filter = df[j]
            df[j + "_Mean_" + str(i)] = [df_filter[calc([x, y],i)].mean() for x, y in zip(df['x'], df['y'])]
    return df

%timeit original(df)  # 100 loops, best of 3: 13.1 ms per loop
%timeit jp(df)        # 100 loops, best of 3: 10.9 ms per loop