R相当于Python的re.findall

I am trying to get all matches for a RegExp from a string but apparently it's not so easy in R, or I have overlooked something. Truth be told, it's really confusing and I found myself lost among all the options: str_extract, str_match, str_match_all, regexec, grep, gregexpr, and who knows how many others.

我试图从字符串中获取RegExp的所有匹配，但显然在R中不是那么容易，或者我忽略了一些东西。说实话，这真的令人困惑，我发现自己迷失在所有选项中：str_extract，str_match，str_match_all，regexec，grep，gregexpr，以及谁知道其他多少。

In reality, all I'm trying to accomplish is simply (in Python):

实际上，我想要完成的只是（在Python中）：

>>> import re
>>> re.findall(r'([\w\']+|[.,;:?!])', 'This is starting to get really, really annoying!!')
['This', 'is', 'starting', 'to', 'get', 'really', ',', 'really', 'annoying', '!', '!']

The problem of the functions mentioned above is that either they return one match, or they return no match at all.

上面提到的功能的问题是它们返回一个匹配，或者它们根本不返回匹配。

1 个解决方案

#1

In general, there is no R exact equivalent to Python re.findall that either returns a list of match values or (a list of) tuples that hold capturing group submatches. The closest is str_match_all from the stringr package, but it is also very close to the Python re.finditer (as it returns the match value in the first item and then all submatches (capturing group contents) in the subsequent items (still not exact equivalent of re.finditer as only texts are returned, not match data objects)). So, if the whole match value was not returned with str_match_all, it would be an exact equivalent to Python re.findall.

通常，没有R完全等同于Python re.findall，它返回一个匹配值列表或（一个列表）包含捕获组子匹配的元组。最接近的是来自stringr包的str_match_all，但它也非常接近Python re.finditer（因为它返回第一项中的匹配值，然后返回后续项中的所有子匹配（捕获组内容）（仍然不是完全等效） re.finditer只返回文本，不匹配数据对象））。因此，如果没有使用str_match_all返回整个匹配值，那么它将与Python re.findall完全等效。

You are using re.findall to just return matches, not captures, the capturing group in your pattern is redundant, and you may remove it. Thus, you can safely use regmatches with gregexpr and a PCRE flavor (since [\\w'] won't work with a TRE regex):

您正在使用re.findall返回匹配项而不是捕获项，模式中的捕获组是多余的，您可以将其删除。因此，您可以安全地使用gregexpr和PCRE风格的regmatches（因为[\\ w']不适用于TRE正则表达式）：

s <- "This is starting to get really, really annoying!!"
res <- regmatches(s, gregexpr("[\\w']+|[.,;:?!]", s, perl=TRUE))
## => [[1]]
[1] "This"     "is"      "starting" "to"       "get"      "really"  
[7] ","        "really"   "annoying" "!"        "!"

See the R demo

参见R演示

Or, to make \w Unicode-aware, to make it work as in Python 3, add (*UCP) PCRE verb:

或者，为了使\ n具有Unicode感知功能，使其在Python 3中工作，添加（* UCP）PCRE动词：

res <- regmatches(s, gregexpr("(*UCP)[\\w']+|[.,;:?!]", s, perl=TRUE))

See another R demo

看另一个R演示

If you want to use stringr package (that uses ICU regex library behind the scenes), you need str_extract_all:

如果你想使用stringr包（在幕后使用ICU正则表达式库），你需要str_extract_all：

res <- str_extract_all(s, "[\\w']+|[.,;:?!]")

#1

s <- "This is starting to get really, really annoying!!"
res <- regmatches(s, gregexpr("[\\w']+|[.,;:?!]", s, perl=TRUE))
## => [[1]]
[1] "This"     "is"      "starting" "to"       "get"      "really"  
[7] ","        "really"   "annoying" "!"        "!"