从R中的字符串中提取数字模式

时间:2022-09-13 16:14:44

I am relatively new to regular expressions and I am running into a dead end. I have a data frame with a column that looks like this:

我对正则表达式比较陌生,而且我正陷入死胡同。我有一个数据框,其列如下所示:

year1
GMM14_2000_NGVA
GMM14_2001_NGVA
GMM14_2002_NGVA
...
GMM14_2014_NGVA

I am trying to extract the year in the middle of the string (2000,2001, etc). This is my code thus far

我试图在字符串中间提取年份(2000,2001等)。到目前为止,这是我的代码

gsub("[^0-9]","",year1))

Which returns the number but it also returns the 14 that is part of the string:

返回数字,但它也返回14作为字符串的一部分:

142000
142001

Any idea on how to exclude the 14 from the pattern or how to extract the year information more efficiently?

有关如何从模式中排除14或如何更有效地提取年份信息的任何想法?

Thanks

5 个解决方案

#1


5  

Use the following gsub:

使用以下gsub:

s  = "GMM14_2002_NGVA"
gsub("^[^_]*_|_[^_]*$", "", s)

See IDEONE demo

请参阅IDEONE演示

The regex breakdown:

正则表达式分解:

Match...

  • ^[^_]*_ - 0 or more characters other than _ from the start of string and a_
  • ^ [^ _] * _ - 从字符串和a_开头的_以外的0个或更多字符

  • | - or...
  • | - 要么...

  • _[^_]*$ - a _ and 0 or more characters other than _ to the end of string
  • _ [^ _] * $ - 字符串末尾的_和0以外的字符

and remove them.

并删除它们。

As an alternative,

作为备选,

library(stringr)
str_extract(s,"(?<=_)\\d{4}(?=_)")

Where the Perl-like regex matches 4-digit substring that is enclosed with underscores.

Perl-like regex匹配用下划线括起来的4位子字符串。

#2


7  

Using stringi package, the following is one way. The assumption is that year is in 4 digits. Since you specify the digit number, this is pretty straightfoward.

使用stringi包,以下是一种方法。假设年份是4位数。由于您指定了数字编号,这是非常直接的。

library(stringi)

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

stri_extract_last(x, regex = "\\d{4}")
#[1] "2000" "2001"

or

stri_extract_first(x, regex = "\\d{4}")
#[1] "2000" "2001"

#3


2  

Another option in base-R would be strsplit using @jazzurro 's data:

base-R中的另一个选项是使用@jazzurro的数据进行strsplit:

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

vapply(strsplit(x, '_'), function(x) x[2], character(1))
[1] "2000" "2001"

strsplit splits each element of the x vector on the underscores _ and outputs a list of the same length as length x. Using vapply we collect the second element of each vector in the list i.e. the year between underscores.

strsplit在下划线_上分割x向量的每个元素,并输出与长度x相同长度的列表。使用vapply,我们收集列表中每个向量的第二个元素,即下划线之间的年份。

#4


2  

You may use sub.

你可以使用sub。

sub(".*_(\\d{4})_.*", "\\1", x)

or

devtools::install_github("Avinash-Raj/dangas")
library(dangas)
extract_a("_", "_", x)

This would extract all the chars present in-between the start and end delimiters. Here the start and end delimiter is underscore.

这将提取开始和结束分隔符之间存在的所有字符。这里的开始和结束分隔符是下划线。

syntax:

extract_a(start, end, string)

#5


0  

I never used R but had deep experience with regexps.

我从未使用过R但是对regexp有很深的经验。

Idiomatically proper way would be to use matching.

惯用的方法是使用匹配。

For R it should be regmatches:

对于R,它应该是regmatches:

Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from regexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.

使用regmatches获取正则表达式匹配的实际子字符串。作为第一个参数,传递传递给regexpr或gregexpr的相同输入。作为第二个参数,传递regexpr或gregexpr返回的向量。如果从regexpr传递向量,则regmatches将返回包含所有匹配字符串的字符向量。如果在某些元素中未找到匹配,则该向量可能比输入向量短。如果从regexpr传递向量,则regmatches将返回一个与输入向量具有相同元素数的向量。每个元素都是一个字符向量,其中包含输入向量中相应元素的所有匹配项,如果元素没有匹配项,则为NULL。

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1]  "a"  "a"  "aa"

In you case it should be:

在你的情况下它应该是:

m <- regexpr("\d{4}", year1, perl=TRUE)
regmatches(year1, m)

In case if you can have another 4 digits in a row in the same string you can use non capturing groups. Probably like this:

如果您可以在同一个字符串中连续使用另外4位数字,则可以使用非捕获组。可能是这样的:

"(?:_)\d{4}(?:_)"

Sorry, have no chance to test all this in R.

对不起,没有机会在R中测试这一切。

#1


5  

Use the following gsub:

使用以下gsub:

s  = "GMM14_2002_NGVA"
gsub("^[^_]*_|_[^_]*$", "", s)

See IDEONE demo

请参阅IDEONE演示

The regex breakdown:

正则表达式分解:

Match...

  • ^[^_]*_ - 0 or more characters other than _ from the start of string and a_
  • ^ [^ _] * _ - 从字符串和a_开头的_以外的0个或更多字符

  • | - or...
  • | - 要么...

  • _[^_]*$ - a _ and 0 or more characters other than _ to the end of string
  • _ [^ _] * $ - 字符串末尾的_和0以外的字符

and remove them.

并删除它们。

As an alternative,

作为备选,

library(stringr)
str_extract(s,"(?<=_)\\d{4}(?=_)")

Where the Perl-like regex matches 4-digit substring that is enclosed with underscores.

Perl-like regex匹配用下划线括起来的4位子字符串。

#2


7  

Using stringi package, the following is one way. The assumption is that year is in 4 digits. Since you specify the digit number, this is pretty straightfoward.

使用stringi包,以下是一种方法。假设年份是4位数。由于您指定了数字编号,这是非常直接的。

library(stringi)

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

stri_extract_last(x, regex = "\\d{4}")
#[1] "2000" "2001"

or

stri_extract_first(x, regex = "\\d{4}")
#[1] "2000" "2001"

#3


2  

Another option in base-R would be strsplit using @jazzurro 's data:

base-R中的另一个选项是使用@jazzurro的数据进行strsplit:

x <- c("GMM14_2000_NGVA", "GMM14_2001_NGVA")

vapply(strsplit(x, '_'), function(x) x[2], character(1))
[1] "2000" "2001"

strsplit splits each element of the x vector on the underscores _ and outputs a list of the same length as length x. Using vapply we collect the second element of each vector in the list i.e. the year between underscores.

strsplit在下划线_上分割x向量的每个元素,并输出与长度x相同长度的列表。使用vapply,我们收集列表中每个向量的第二个元素,即下划线之间的年份。

#4


2  

You may use sub.

你可以使用sub。

sub(".*_(\\d{4})_.*", "\\1", x)

or

devtools::install_github("Avinash-Raj/dangas")
library(dangas)
extract_a("_", "_", x)

This would extract all the chars present in-between the start and end delimiters. Here the start and end delimiter is underscore.

这将提取开始和结束分隔符之间存在的所有字符。这里的开始和结束分隔符是下划线。

syntax:

extract_a(start, end, string)

#5


0  

I never used R but had deep experience with regexps.

我从未使用过R但是对regexp有很深的经验。

Idiomatically proper way would be to use matching.

惯用的方法是使用匹配。

For R it should be regmatches:

对于R,它应该是regmatches:

Use regmatches to get the actual substrings matched by the regular expression. As the first argument, pass the same input that you passed to regexpr or gregexpr . As the second argument, pass the vector returned by regexpr or gregexpr. If you pass the vector from regexpr then regmatches returns a character vector with all the strings that were matched. This vector may be shorter than the input vector if no match was found in some of the elements. If you pass the vector from regexpr then regmatches returns a vector with the same number of elements as the input vector. Each element is a character vector with all the matches of the corresponding element in the input vector, or NULL if an element had no matches.

使用regmatches获取正则表达式匹配的实际子字符串。作为第一个参数,传递传递给regexpr或gregexpr的相同输入。作为第二个参数,传递regexpr或gregexpr返回的向量。如果从regexpr传递向量,则regmatches将返回包含所有匹配字符串的字符向量。如果在某些元素中未找到匹配,则该向量可能比输入向量短。如果从regexpr传递向量,则regmatches将返回一个与输入向量具有相同元素数的向量。每个元素都是一个字符向量,其中包含输入向量中相应元素的所有匹配项,如果元素没有匹配项,则为NULL。

>x <- c("abc", "def", "cba a", "aa")
> m <- regexpr("a+", x, perl=TRUE)
> regmatches(x, m)
[1]  "a"  "a"  "aa"

In you case it should be:

在你的情况下它应该是:

m <- regexpr("\d{4}", year1, perl=TRUE)
regmatches(year1, m)

In case if you can have another 4 digits in a row in the same string you can use non capturing groups. Probably like this:

如果您可以在同一个字符串中连续使用另外4位数字,则可以使用非捕获组。可能是这样的:

"(?:_)\d{4}(?:_)"

Sorry, have no chance to test all this in R.

对不起,没有机会在R中测试这一切。