pandas常用小trick(持续更新)

记录一下pandas常用的小技巧，时间长了干别的去了会忘记，记录一下：

1. 在处理数据过程中涉及到label和null的处理方法

# 方法一

df['height'][df.height < 180] = 0

df['height'][df.height >= 180] = 1

# 方法二

df['height'].ix[df['height'] < 180] = 0

df['height'].ix[df['height'] >= 180] = 1

# 方法三

df.loc[df['height'] < 180, 'height'] = 0

df.loc[df['height'] >= 180, 'height'] = 1

# 方法四，前三种方法顺序是不能够颠倒的

df['height'] = df['height'].apply(lambda x: 1 if x >= 180 else 0)

# 对null值的替换处理

df.loc[df['age'].isnull(), 'age'] = df['age'].median()

2. pandas中对两列的处理，比如字符串拼接

# 下面的操作在挖掘机器学习二阶特征的时候会经常用到
def str_add(x, y):

    # print x, y

    return str(x) + '_' + str(y)

df = pd.read_csv('./tmp.txt')

df['age_height'] = df.apply(lambda row: str_add(row['age'], row['height']), axis=1)

3. 特征对比图

import matplotlib.pyplot as plt

import seaborn as sns

df1 = pd.read_csv("./anti-fraud-final_train.csv")

df2 = pd.read_csv("./anti-fraud-final_test.csv")

var = ['f1','f2','f3']

plt.figure(figsize=(30, 10))

for i in range(0, 20, 1):

    plt.subplot(4, 5, i + 1)

    sns.kdeplot(df1[var[i]], label=var[i])

    sns.kdeplot(df2[var[i]], label=var[i])

plt.show()

秒客网

pandas常用小trick(持续更新)

相关文章