从特定模式后的列中提取字符串

时间:2022-09-13 07:35:28

Please forgive my panda newbie question, but I have a column of U.S. towns and states, such as the truncated version shown below (For some strange reason, the name of the column is called 'Alabama[edit]' which is associated with the first 0-7 town values in the column):

请原谅我的熊猫新手问题,但我有一列美国城镇和州,例如下面显示的截断版本(由于某些奇怪的原因,该列的名称被称为'Alabama [edit]',它与第一个相关联列中的0-7镇值):

0                          Auburn (Auburn University)[1]1                 Florence (University of North Alabama)2        Jacksonville (Jacksonville State University)[2]3             Livingston (University of West Alabama)[2]4               Montevallo (University of Montevallo)[2]5                              Troy (Troy University)[2]6      Tuscaloosa (University of Alabama, Stillman Co...7                      Tuskegee (Tuskegee University)[5]8                                           Alaska[edit]9          Fairbanks (University of Alaska Fairbanks)[2]10                                         Arizona[edit]11            Flagstaff (Northern Arizona University)[6]12                      Tempe (Arizona State University)13                        Tucson (University of Arizona)14                                        Arkansas[edit]15     Arkadelphia (Henderson State University, Ouach...16     Conway (Central Baptist College, Hendrix Colle...17              Fayetteville (University of Arkansas)[7]18              Jonesboro (Arkansas State University)[8]19            Magnolia (Southern Arkansas University)[2]20     Monticello (University of Arkansas at Monticel...21            Russellville (Arkansas Tech University)[2]22                        Searcy (Harding University)[5]23                                      California[edit]

The towns that are in each state are below each state name, e.g. Fairbanks (column value 9) is a town in the state of Alaska.

每个州的城镇都在每个州名下面,例如费尔班克斯(第9列)是阿拉斯加州的一个小镇。

What I want to do is to split up the town names based on the state names so that I have two columns 'State' and 'RegionName' where each state name is associated with each town name, like so:

我想要做的是根据州名拆分城镇名称,这样我就有两列'State'和'RegionName',其中每个州名都与每个城镇名相关联,如下所示:

                            RegionName                       State0                          Auburn (Auburn University)[1]    Alabama1                 Florence (University of North Alabama)    Alabama2        Jacksonville (Jacksonville State University)[2]    Alabama3             Livingston (University of West Alabama)[2]    Alabama4               Montevallo (University of Montevallo)[2]    Alabama5                              Troy (Troy University)[2]    Alabama6      Tuscaloosa (University of Alabama, Stillman Co...    Alabama7                      Tuskegee (Tuskegee University)[5]    Alabama8         Fairbanks (University of Alaska Fairbanks)[2]     Alaska9         Flagstaff (Northern Arizona University)[6]        Arizona10                      Tempe (Arizona State University)    Arizona11                        Tucson (University of Arizona)    Arizona                                              12        Arkadelphia (Henderson State University, Ouach... Arkansas                                           

. . .etc.

。 。 。等等。

I know that each state name is followed by a string '[edit]', which I assume I can use to do the split and assignment of the town names. But I don't know how to do this.

我知道每个州名后跟一个字符串'[edit]',我假设我可以用它来分割和分配城镇名称。但我不知道该怎么做。

Also, I know that there's a lot of other data cleaning I need to do, such as removing the strings within parentheses and within the brackets '[]'. That can be done later...the important part is splitting up the states and towns and assigning each town to its proper U.S. Any advice would be most appreciated.

另外,我知道我需要做很多其他数据清理工作,比如删除括号内的字符串和括号'[]'。这可以在以后完成......重要的部分是拆分州和城镇,并将每个城镇分配到适当的美国。任何建议都将受到最高的赞赏。

1 个解决方案

#1


2  

Without much context or access to your data, I'd suggest something along these lines. First, modify the code that reads your data:

如果没有太多的上下文或访问您的数据,我会建议这些内容。首先,修改读取数据的代码:

df = pd.read_csv(..., header=None, names=['RegionName']) # add header=False so as to read the first row as data

Now, extract the state name using str.extract, this should only extract names as long as they are succeeded by the substring "[edit]". You can then forward fill all NaN values using ffill.

现在,使用str.extract提取状态名称,这应该只提取名称,只要它们由子字符串“[edit]”继承。然后,您可以使用ffill转发所有NaN值。

df['State'] = df['RegionName'].str.extract(    r'(?P<State>.*)(?=\s*\[edit\])').ffill()

#1


2  

Without much context or access to your data, I'd suggest something along these lines. First, modify the code that reads your data:

如果没有太多的上下文或访问您的数据,我会建议这些内容。首先,修改读取数据的代码:

df = pd.read_csv(..., header=None, names=['RegionName']) # add header=False so as to read the first row as data

Now, extract the state name using str.extract, this should only extract names as long as they are succeeded by the substring "[edit]". You can then forward fill all NaN values using ffill.

现在,使用str.extract提取状态名称,这应该只提取名称,只要它们由子字符串“[edit]”继承。然后,您可以使用ffill转发所有NaN值。

df['State'] = df['RegionName'].str.extract(    r'(?P<State>.*)(?=\s*\[edit\])').ffill()