pandas DataFrame:用平均列替换nan值

时间:2023-01-07 11:46:07

I've got a pandas DataFrame filled mostly with real numbers, but there is a few nan values in it as well.

我有一个大概用实数填充的pandas DataFrame,但它中也有一些nan值。

How can I replace the nans with averages of columns where they are?

如何用平均列替换nans?

This question is very similar to this one: numpy array: replace nan values with average of columns but, unfortunately, the solution given there doesn't work for a pandas DataFrame.

这个问题与这个问题非常相似:numpy数组:用列的平均值替换nan值但不幸的是,给出的解决方案对于pandas DataFrame不起作用。

7 个解决方案

#1


115  

You can simply use DataFrame.fillna to fill the nan's directly:

您可以直接使用DataFrame.fillna来填充nan:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().

fillna的文档字符串表示值应该是标量或字典,但是,它似乎也适用于系列。如果你想传递一个字典,你可以使用df.mean()。to_dict()。

#2


17  

Try:

尝试:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

#3


9  

In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

每列应用该列的平均值并填充

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

#4


4  

# To read data from csv file
Dataset = pd.read_csv('Data.csv')

# To divide input in X and y axis
X = Dataset.iloc[:, :-1].values
Y = Dataset.iloc[:, 3].values

# To calculate mean use imputer class

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)


 imputer = imputer.fit(X[:, 1:3])
    X[:, 1:3] = imputer.transform(X[:, 1:3])

#5


2  

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

如果你想用平均值来估算缺失值并且你想逐列,那么这只会用该列的平均值来估算。这可能更具可读性。

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

#6


1  

Another option besides those above is:

除上述选择外,另一个选择是:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.

它比以前的均值响应更不优雅,但如果你想用其他列函数替换空值,它可能会更短。

#7


0  

Directly use df.fillna(df.mean()) to fill all the null value with mean

直接使用df.fillna(df.mean())用mean填充所有null值

If you want to fill null value with mean of that column then you can use this

如果要使用该列的平均值填充空值,则可以使用此值

suppose x=df['Item_Weight'] here Item_Weight is column name

假设x = df ['Item_Weight']这里Item_Weight是列名

here we are assigning (fill null values of x with mean of x into x)

这里我们分配(将x的空值填充到x的x的平均值)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

如果要使用某些字符串填充空值,请使用

here Outlet_size is column name

这里的Outlet_size是列名

df.Outlet_Size = df.Outlet_Size.fillna('Missing')

#1


115  

You can simply use DataFrame.fillna to fill the nan's directly:

您可以直接使用DataFrame.fillna来填充nan:

In [27]: df 
Out[27]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3       NaN -2.027325  1.533582
4       NaN       NaN  0.461821
5 -0.788073       NaN       NaN
6 -0.916080 -0.612343       NaN
7 -0.887858  1.033826       NaN
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

In [28]: df.mean()
Out[28]: 
A   -0.151121
B   -0.231291
C   -0.530307
dtype: float64

In [29]: df.fillna(df.mean())
Out[29]: 
          A         B         C
0 -0.166919  0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.151121 -2.027325  1.533582
4 -0.151121 -0.231291  0.461821
5 -0.788073 -0.231291 -0.530307
6 -0.916080 -0.612343 -0.530307
7 -0.887858  1.033826 -0.530307
8  1.948430  1.025011 -2.982224
9  0.019698 -0.795876 -0.046431

The docstring of fillna says that value should be a scalar or a dict, however, it seems to work with a Series as well. If you want to pass a dict, you could use df.mean().to_dict().

fillna的文档字符串表示值应该是标量或字典,但是,它似乎也适用于系列。如果你想传递一个字典,你可以使用df.mean()。to_dict()。

#2


17  

Try:

尝试:

sub2['income'].fillna((sub2['income'].mean()), inplace=True)

#3


9  

In [16]: df = DataFrame(np.random.randn(10,3))

In [17]: df.iloc[3:5,0] = np.nan

In [18]: df.iloc[4:6,1] = np.nan

In [19]: df.iloc[5:8,2] = np.nan

In [20]: df
Out[20]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3       NaN -0.985188 -0.324136
4       NaN       NaN  0.238512
5  0.769657       NaN       NaN
6  0.141951  0.326064       NaN
7 -1.694475 -0.523440       NaN
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

In [22]: df.mean()
Out[22]: 
0   -0.251534
1   -0.040622
2   -0.841219
dtype: float64

Apply per-column the mean of that columns and fill

每列应用该列的平均值并填充

In [23]: df.apply(lambda x: x.fillna(x.mean()),axis=0)
Out[23]: 
          0         1         2
0  1.148272  0.227366 -2.368136
1 -0.820823  1.071471 -0.784713
2  0.157913  0.602857  0.665034
3 -0.251534 -0.985188 -0.324136
4 -0.251534 -0.040622  0.238512
5  0.769657 -0.040622 -0.841219
6  0.141951  0.326064 -0.841219
7 -1.694475 -0.523440 -0.841219
8  0.352556 -0.551487 -1.639298
9 -2.067324 -0.492617 -1.675794

#4


4  

# To read data from csv file
Dataset = pd.read_csv('Data.csv')

# To divide input in X and y axis
X = Dataset.iloc[:, :-1].values
Y = Dataset.iloc[:, 3].values

# To calculate mean use imputer class

from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)


 imputer = imputer.fit(X[:, 1:3])
    X[:, 1:3] = imputer.transform(X[:, 1:3])

#5


2  

If you want to impute missing values with mean and you want to go column by column, then this will only impute with the mean of that column. This might be a little more readable.

如果你想用平均值来估算缺失值并且你想逐列,那么这只会用该列的平均值来估算。这可能更具可读性。

sub2['income'] = sub2['income'].fillna((sub2['income'].mean()))

#6


1  

Another option besides those above is:

除上述选择外,另一个选择是:

df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))

It's less elegant than previous responses for mean, but it could be shorter if you desire to replace nulls by some other column function.

它比以前的均值响应更不优雅,但如果你想用其他列函数替换空值,它可能会更短。

#7


0  

Directly use df.fillna(df.mean()) to fill all the null value with mean

直接使用df.fillna(df.mean())用mean填充所有null值

If you want to fill null value with mean of that column then you can use this

如果要使用该列的平均值填充空值,则可以使用此值

suppose x=df['Item_Weight'] here Item_Weight is column name

假设x = df ['Item_Weight']这里Item_Weight是列名

here we are assigning (fill null values of x with mean of x into x)

这里我们分配(将x的空值填充到x的x的平均值)

df['Item_Weight'] = df['Item_Weight'].fillna((df['Item_Weight'].mean()))

If you want to fill null value with some string then use

如果要使用某些字符串填充空值,请使用

here Outlet_size is column name

这里的Outlet_size是列名

df.Outlet_Size = df.Outlet_Size.fillna('Missing')