R:通过正则表达式提取字符串匹配部分的列表

时间:2022-09-13 13:26:25

Let's say that I need to extract different parts from a string as list, for example I would like to divide the string "aaa12xxx" in three parts.

假设我需要从字符串中提取不同的部分作为列表,例如我想将字符串“aaa12xxx”分成三部分。

One possibility is to do three gsub calls:

一种可能性是做三个gsub调用:

parts = c()
parts[1] = gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\1', "aaa12xxx")
parts[2] = gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\2', "aaa12xxx")
parts[3] = gsub('([[:alpha:]]+)([0-9]+)([[:alpha:]]+)', '\\3', "aaa12xxx")

Of course this seems quite a waste (even if it's inside a for loop). Isn't there a function that simply returns the list of parts from a regex and a test string?

当然这看起来很浪费(即使它在for循环中)。是不是只有从正则表达式和测试字符串返回部件列表的函数?

2 个解决方案

#1


4  

Just split the input string through strsplit and get the parts you want..

只需通过strsplit分割输入字符串并获取所需的部分。

> x <- "aaa12xxx"
> strsplit(x,"(?<=[[:alpha:]])(?=\\d)|(?<=\\d)(?=[[:alpha:]])", perl=TRUE)
[[1]]
[1] "aaa" "12"  "xxx"

Get the parts by specifying the index number..

通过指定索引号来获取零件。

> m <- unlist(strsplit(x,"(?<=[[:alpha:]])(?=\\d)|(?<=\\d)(?=[[:alpha:]])", perl=TRUE))
> m[1]
[1] "aaa"
> m[2]
[1] "12"
> m[3]
[1] "xxx"
  • (?<=[[:alpha:]])(?=\\d) Matches all the boundaries which are preceded by an alphabet and followed by a digit.

    (?<= [[:alpha:]])(?= \\ d)匹配前面有字母表并后跟数字的所有边界。

  • | OR

    |要么

  • (?<=\\d)(?=[[:alpha:]]) Matches all the boundaries which are preceded by a digit and followed by an alphabet.

    (?<= \\ d)(?= [[:alpha:]])匹配前面有数字并后跟字母表的所有边界。

  • Splitting your input according to the matched boundaries will give you the desired output.

    根据匹配的边界拆分输入将为您提供所需的输出。

#2


3  

(\\d+)|([a-zA-Z]+)

or

要么

([[:alpha:]]+)|([0-9]+)

You can just grab the capture.use str_match_all() from library(stringr).See demo.

你可以从库(stringr)中获取capture.use str_match_all()。参见demo。

https://regex101.com/r/fA6wE2/8

https://regex101.com/r/fA6wE2/8

#1


4  

Just split the input string through strsplit and get the parts you want..

只需通过strsplit分割输入字符串并获取所需的部分。

> x <- "aaa12xxx"
> strsplit(x,"(?<=[[:alpha:]])(?=\\d)|(?<=\\d)(?=[[:alpha:]])", perl=TRUE)
[[1]]
[1] "aaa" "12"  "xxx"

Get the parts by specifying the index number..

通过指定索引号来获取零件。

> m <- unlist(strsplit(x,"(?<=[[:alpha:]])(?=\\d)|(?<=\\d)(?=[[:alpha:]])", perl=TRUE))
> m[1]
[1] "aaa"
> m[2]
[1] "12"
> m[3]
[1] "xxx"
  • (?<=[[:alpha:]])(?=\\d) Matches all the boundaries which are preceded by an alphabet and followed by a digit.

    (?<= [[:alpha:]])(?= \\ d)匹配前面有字母表并后跟数字的所有边界。

  • | OR

    |要么

  • (?<=\\d)(?=[[:alpha:]]) Matches all the boundaries which are preceded by a digit and followed by an alphabet.

    (?<= \\ d)(?= [[:alpha:]])匹配前面有数字并后跟字母表的所有边界。

  • Splitting your input according to the matched boundaries will give you the desired output.

    根据匹配的边界拆分输入将为您提供所需的输出。

#2


3  

(\\d+)|([a-zA-Z]+)

or

要么

([[:alpha:]]+)|([0-9]+)

You can just grab the capture.use str_match_all() from library(stringr).See demo.

你可以从库(stringr)中获取capture.use str_match_all()。参见demo。

https://regex101.com/r/fA6wE2/8

https://regex101.com/r/fA6wE2/8