用pandas中的NaN替换空白值(空白区域)

时间:2022-02-21 11:46:28

I want to find all values in a Pandas dataframe that contain whitespace (any arbitrary amount) and replace those values with NaNs.

我想在Pandas数据帧中找到包含空格(任意数量)的所有值,并用NaN替换这些值。

Any ideas how this can be improved?

有什么想法可以改进吗?

Basically I want to turn this:

基本上我想转此:

                   A    B    C
2000-01-01 -0.532681  foo    0
2000-01-02  1.490752  bar    1
2000-01-03 -1.387326  foo    2
2000-01-04  0.814772  baz     
2000-01-05 -0.222552         4
2000-01-06 -1.176781  qux     

Into this:

进入这个:

                   A     B     C
2000-01-01 -0.532681   foo     0
2000-01-02  1.490752   bar     1
2000-01-03 -1.387326   foo     2
2000-01-04  0.814772   baz   NaN
2000-01-05 -0.222552   NaN     4
2000-01-06 -1.176781   qux   NaN

I've managed to do it with the code below, but man is it ugly. It's not Pythonic and I'm sure it's not the most efficient use of pandas either. I loop through each column and do boolean replacement against a column mask generated by applying a function that does a regex search of each value, matching on whitespace.

我已经设法用下面的代码来做,但是男人是丑陋的。它不是Pythonic,我敢肯定它也不是最有效的熊猫用途。我循环遍历每一列并对通过应用对每个值进行正则表达式搜索的函数生成的列掩码进行布尔替换,并在空白上进行匹配。

for i in df.columns:
    df[i][df[i].apply(lambda i: True if re.search('^\s*$', str(i)) else False)]=None

It could be optimized a bit by only iterating through fields that could contain empty strings:

可以通过迭代可能包含空字符串的字段来优化它:

if df[i].dtype == np.dtype('object')

But that's not much of an improvement

但这并没有太大的改善

And finally, this code sets the target strings to None, which works with Pandas' functions like fillna(), but it would be nice for completeness if I could actually insert a NaN directly instead of None.

最后,这段代码将目标字符串设置为None,它与Pandas的函数(如fillna())一起使用,但如果我能直接插入NaN而不是None,那么它的完整性会很好。

Help!

帮帮我!

8 个解决方案

#1


89  

I think df.replace() does the job:

我认为df.replace()完成了这项工作:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

print df.replace(r'\s+', np.nan, regex=True)

Produces:

生产:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

#2


26  

How about:

怎么样:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe.

applymap函数将函数应用于数据帧的每个单元格。

#3


8  

I will did this:

我会这样做的:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

要么

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan.

你可以去除所有str,然后用np.nan替换空str。

#4


7  

If you only want to replace empty string and records with only spaces, the correct answer is!:

如果你只想用空格替换空字符串和记录,那么正确的答案就是!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer

接受的答案

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

不替换空字符串!,您可以尝试使用稍微更新的给定示例:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that 'fo o' is not replaced with Nan, though it contains a space. Further note, that a simple:

注意,'fo o'也没有被Nan替换,尽管它包含一个空格。进一步说明,这很简单:

df.replace(r'', np.NaN)

Does not work either - try it out.

也不起作用 - 尝试一下。

#5


2  

If you are exporting the data from the CSV file it can be as simple as this :

如果要从CSV文件导出数据,可以这么简单:

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na

这将创建数据框以及将空值替换为Na

#6


0  

For a very fast and simple solution where you check equality against a single value, you can use the mask method.

对于一个非常快速和简单的解决方案,您可以使用掩码方法检查单个值的相等性。

df.mask(df == ' ')

#7


0  

you can also use a filter to do it.

你也可以使用过滤器来做到这一点。

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

#8


0  

Simplest of all solutions:

最简单的解决方案:

df = df.replace(r'^\s+$', np.nan, regex=True)

#1


89  

I think df.replace() does the job:

我认为df.replace()完成了这项工作:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

print df.replace(r'\s+', np.nan, regex=True)

Produces:

生产:

                   A    B   C
2000-01-01 -0.532681  foo   0
2000-01-02  1.490752  bar   1
2000-01-03 -1.387326  foo   2
2000-01-04  0.814772  baz NaN
2000-01-05 -0.222552  NaN   4
2000-01-06 -1.176781  qux NaN

#2


26  

How about:

怎么样:

d = d.applymap(lambda x: np.nan if isinstance(x, basestring) and x.isspace() else x)

The applymap function applies a function to every cell of the dataframe.

applymap函数将函数应用于数据帧的每个单元格。

#3


8  

I will did this:

我会这样做的:

df = df.apply(lambda x: x.str.strip()).replace('', np.nan)

or

要么

df = df.apply(lambda x: x.str.strip() if isinstance(x, str) else x).replace('', np.nan)

You can strip all str, then replace empty str with np.nan.

你可以去除所有str,然后用np.nan替换空str。

#4


7  

If you only want to replace empty string and records with only spaces, the correct answer is!:

如果你只想用空格替换空字符串和记录,那么正确的答案就是!:

df = df.replace(r'^\s*$', np.nan, regex=True)

The accepted answer

接受的答案

df.replace(r'\s+', np.nan, regex=True)

Does not replace an empty string!, you can try yourself with the given example slightly updated:

不替换空字符串!,您可以尝试使用稍微更新的给定示例:

df = pd.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'fo o', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', ''],         
], columns='A B C'.split(), index=pd.date_range('2000-01-01','2000-01-06'))

Note, also that 'fo o' is not replaced with Nan, though it contains a space. Further note, that a simple:

注意,'fo o'也没有被Nan替换,尽管它包含一个空格。进一步说明,这很简单:

df.replace(r'', np.NaN)

Does not work either - try it out.

也不起作用 - 尝试一下。

#5


2  

If you are exporting the data from the CSV file it can be as simple as this :

如果要从CSV文件导出数据,可以这么简单:

df = pd.read_csv(file_csv, na_values=' ')

This will create the data frame as well as replace blank values as Na

这将创建数据框以及将空值替换为Na

#6


0  

For a very fast and simple solution where you check equality against a single value, you can use the mask method.

对于一个非常快速和简单的解决方案,您可以使用掩码方法检查单个值的相等性。

df.mask(df == ' ')

#7


0  

you can also use a filter to do it.

你也可以使用过滤器来做到这一点。

df = PD.DataFrame([
    [-0.532681, 'foo', 0],
    [1.490752, 'bar', 1],
    [-1.387326, 'foo', 2],
    [0.814772, 'baz', ' '],     
    [-0.222552, '   ', 4],
    [-1.176781,  'qux', '  '])
    df[df=='']='nan'
    df=df.astype(float)

#8


0  

Simplest of all solutions:

最简单的解决方案:

df = df.replace(r'^\s+$', np.nan, regex=True)