通过分隔符计数和位置从数据框中提取特定文本

时间:2022-09-17 23:41:14

Learning regular expressions and stumbled into a bit of a wall. I have the following dataframe:

学习正则表达式,偶然发现了一点墙。我有以下数据帧:

item_data=pandas.DataFrame({'item':['001','002','003'],
'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs']})

For each description, I want to be extract everything prior to the second comma ",". If there is no comma, then the original description is retained

对于每个描述,我想在第二个逗号“,”之前提取所有内容。如果没有逗号,则保留原始描述

Results should look like this:

结果应如下所示:

item_data=pandas.DataFrame({'item':['001','002','003'],
'description':['Fishing,Hooks,12-inch','Fishing,Lines','Fish Eggs'],
'new_description':['Fishing,Hooks','Fishing,Lines', 'Fish Eggs']})

Any pointers would be much appreciated.

任何指针都将非常感激。

Thanks.

2 个解决方案

#1


1  

Using a regexp...

使用正则表达式...

re.sub("^([^,]*,[^,]*),.*$", "\\1", x)

meaning is

  • ^ start of string
  • ^字符串的开始

  • ( start capture
  • (开始捕捉

  • [^,] anything but a comma
  • [^,]除了逗号之外的任何东西

  • * zero or more times
  • *零次或多次

  • , a comma
  • 一个逗号

  • [^,] anything but a comma
  • [^,]除了逗号之外的任何东西

  • * zero or more times
  • *零次或多次

  • ) end of capture
  • )捕获结束

  • , another comma
  • ,另一个逗号

  • .* anything
  • $ end of string
  • $ end of string

Replacing with the content of group 1 (\1) drops whatever is present after the second comma

替换为组1(\ 1)的内容会删除第二个逗号后出现的内容

#2


1  

new_description = [",".join(i.split(",")[:2]) for i in item_data['description']]

#1


1  

Using a regexp...

使用正则表达式...

re.sub("^([^,]*,[^,]*),.*$", "\\1", x)

meaning is

  • ^ start of string
  • ^字符串的开始

  • ( start capture
  • (开始捕捉

  • [^,] anything but a comma
  • [^,]除了逗号之外的任何东西

  • * zero or more times
  • *零次或多次

  • , a comma
  • 一个逗号

  • [^,] anything but a comma
  • [^,]除了逗号之外的任何东西

  • * zero or more times
  • *零次或多次

  • ) end of capture
  • )捕获结束

  • , another comma
  • ,另一个逗号

  • .* anything
  • $ end of string
  • $ end of string

Replacing with the content of group 1 (\1) drops whatever is present after the second comma

替换为组1(\ 1)的内容会删除第二个逗号后出现的内容

#2


1  

new_description = [",".join(i.split(",")[:2]) for i in item_data['description']]