在解析Pandas中的csv文件时,如何从字符串中删除额外的空格?

时间:2022-12-30 07:23:20

I have the following file named 'data.csv':

我有以下名为'data.csv'的文件:

    1997,Ford,E350
    1997, Ford , E350
    1997,Ford,E350,"Super, luxurious truck"
    1997,Ford,E350,"Super ""luxurious"" truck"
    1997,Ford,E350," Super luxurious truck "
    "1997",Ford,E350
    1997,Ford,E350
    2000,Mercury,Cougar

And I would like to parse it into a pandas DataFrame so that the DataFrame looks as follows:

我想把它解析成一个pandas DataFrame,以便DataFrame看起来如下:

       Year     Make   Model              Description
    0  1997     Ford    E350                     None
    1  1997     Ford    E350                     None
    2  1997     Ford    E350   Super, luxurious truck
    3  1997     Ford    E350  Super "luxurious" truck
    4  1997     Ford    E350    Super luxurious truck
    5  1997     Ford    E350                     None
    6  1997     Ford    E350                     None
    7  2000  Mercury  Cougar                     None

The best I could do was:

我能做的最好的事情是:

    pd.read_table("data.csv", sep=r',', names=["Year", "Make", "Model", "Description"])

Which gets me:

哪个让我:

    Year     Make   Model              Description
 0  1997     Ford    E350                     None
 1  1997    Ford     E350                     None
 2  1997     Ford    E350   Super, luxurious truck
 3  1997     Ford    E350  Super "luxurious" truck
 4  1997     Ford    E350   Super luxurious truck 
 5  1997     Ford    E350                     None
 6  1997     Ford    E350                     None
 7  2000  Mercury  Cougar                     None

How can I get the DataFrame without those whitespaces?

如何在没有这些空格的情况下获取DataFrame?

7 个解决方案

#1


37  

You could use converters:

你可以使用转换器:

import pandas as pd

def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text

def make_int(text):
    return int(text.strip('" '))

table = pd.read_table("data.csv", sep=r',',
                      names=["Year", "Make", "Model", "Description"],
                      converters = {'Description' : strip,
                                    'Model' : strip,
                                    'Make' : strip,
                                    'Year' : make_int})
print(table)

yields

产量

   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997     Ford    E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350    Super luxurious truck
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None

#2


24  

Well, the whitespace is in your data, so you can't read in the data without reading in the whitespace. However, after you've read it in, you could strip out the whitespace by doing, e.g., df["Make"] = df["Make"].map(str.strip) (where df is your dataframe).

好吧,空白在你的数据中,所以你不能在没有读入空格的情况下读入数据。但是,在你读完之后,你可以通过这样做去除空白,例如,df [“Make”] = df [“Make”] .map(str.strip)(其中df是你的数据帧)。

#3


12  

Adding parameter skipinitialspace=True to read_table worked for me.

将参数skipinitialspace = True添加到read_table为我工作。

So try:

所以尝试:

pd.read_table("data.csv", 
              sep=r',', 
              names=["Year", "Make", "Model", "Description"], 
              skipinitialspace=True)

Same thing works in pd.read_csv().

同样的事情在pd.read_csv()中有效。

#4


6  

I don't have enough reputation to leave a comment, but the answer above suggesting using the map function along with strip won't work if you have NaN values, since strip only works on chars and NaN are floats.

我没有足够的声誉留下评论,但如果您有NaN值,上面的答案建议使用map函数和strip将不起作用,因为strip仅适用于chars而NaN是浮点数。

There is a built-in pandas function to do this, which I used: pd.core.strings.str_strip(df['Description'])
where df is your dataframe. In my case I used it on a dataframe with ~1.2 million rows and it was very fast.

有一个内置的pandas函数来执行此操作,我使用了:pd.core.strings.str_strip(df ['Description'])其中df是您的数据帧。在我的情况下,我在一个约120万行的数据帧上使用它,它非常快。

#5


2  

Here's a function to iterate through each column and apply pd.core.strings.str_strip:

这是一个迭代每列并应用pd.core.strings.str_strip的函数:

def df_strip(df):
  df = df.copy()
  for c in df.columns:
    if df[c].dtype == np.object:
      df[c] = pd.core.strings.str_strip(df[c])
    df = df.rename(columns={c:c.strip()})
  return df

#6


1  

The str.strip() function works really well on Series. Thus, I convert the dataframe column that contains the whitespaces into a Series, strip the whitespace using the str.strip() function and then replace the converted column back into the dataframe. Below is the example code.

str.strip()函数在Series上运行得非常好。因此,我将包含空格的数据帧列转换为系列,使用str.strip()函数剥离空白,然后将转换后的列替换回数据帧。下面是示例代码。

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new

#7


1  

I don't believe Pandas supported this at the time this question was posted but the the most straight forward way to do this is by using regex in the sep parameter of read_csv. So something like the following should work for this issue.

我不相信Pandas在发布此问题时支持此功能,但最直接的方法是在read_csv的sep参数中使用regex。所以类似下面这样的东西应该适用于这个问题。

table = pd.read_table("data.csv", sep=' *, *')

#1


37  

You could use converters:

你可以使用转换器:

import pandas as pd

def strip(text):
    try:
        return text.strip()
    except AttributeError:
        return text

def make_int(text):
    return int(text.strip('" '))

table = pd.read_table("data.csv", sep=r',',
                      names=["Year", "Make", "Model", "Description"],
                      converters = {'Description' : strip,
                                    'Model' : strip,
                                    'Make' : strip,
                                    'Year' : make_int})
print(table)

yields

产量

   Year     Make   Model              Description
0  1997     Ford    E350                     None
1  1997     Ford    E350                     None
2  1997     Ford    E350   Super, luxurious truck
3  1997     Ford    E350  Super "luxurious" truck
4  1997     Ford    E350    Super luxurious truck
5  1997     Ford    E350                     None
6  1997     Ford    E350                     None
7  2000  Mercury  Cougar                     None

#2


24  

Well, the whitespace is in your data, so you can't read in the data without reading in the whitespace. However, after you've read it in, you could strip out the whitespace by doing, e.g., df["Make"] = df["Make"].map(str.strip) (where df is your dataframe).

好吧,空白在你的数据中,所以你不能在没有读入空格的情况下读入数据。但是,在你读完之后,你可以通过这样做去除空白,例如,df [“Make”] = df [“Make”] .map(str.strip)(其中df是你的数据帧)。

#3


12  

Adding parameter skipinitialspace=True to read_table worked for me.

将参数skipinitialspace = True添加到read_table为我工作。

So try:

所以尝试:

pd.read_table("data.csv", 
              sep=r',', 
              names=["Year", "Make", "Model", "Description"], 
              skipinitialspace=True)

Same thing works in pd.read_csv().

同样的事情在pd.read_csv()中有效。

#4


6  

I don't have enough reputation to leave a comment, but the answer above suggesting using the map function along with strip won't work if you have NaN values, since strip only works on chars and NaN are floats.

我没有足够的声誉留下评论,但如果您有NaN值,上面的答案建议使用map函数和strip将不起作用,因为strip仅适用于chars而NaN是浮点数。

There is a built-in pandas function to do this, which I used: pd.core.strings.str_strip(df['Description'])
where df is your dataframe. In my case I used it on a dataframe with ~1.2 million rows and it was very fast.

有一个内置的pandas函数来执行此操作,我使用了:pd.core.strings.str_strip(df ['Description'])其中df是您的数据帧。在我的情况下,我在一个约120万行的数据帧上使用它,它非常快。

#5


2  

Here's a function to iterate through each column and apply pd.core.strings.str_strip:

这是一个迭代每列并应用pd.core.strings.str_strip的函数:

def df_strip(df):
  df = df.copy()
  for c in df.columns:
    if df[c].dtype == np.object:
      df[c] = pd.core.strings.str_strip(df[c])
    df = df.rename(columns={c:c.strip()})
  return df

#6


1  

The str.strip() function works really well on Series. Thus, I convert the dataframe column that contains the whitespaces into a Series, strip the whitespace using the str.strip() function and then replace the converted column back into the dataframe. Below is the example code.

str.strip()函数在Series上运行得非常好。因此,我将包含空格的数据帧列转换为系列,使用str.strip()函数剥离空白,然后将转换后的列替换回数据帧。下面是示例代码。

import pandas as pd
data = pd.DataFrame({'values': ['   ABC   ', '   DEF', '  GHI  ']})
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new

#7


1  

I don't believe Pandas supported this at the time this question was posted but the the most straight forward way to do this is by using regex in the sep parameter of read_csv. So something like the following should work for this issue.

我不相信Pandas在发布此问题时支持此功能,但最直接的方法是在read_csv的sep参数中使用regex。所以类似下面这样的东西应该适用于这个问题。

table = pd.read_table("data.csv", sep=' *, *')