如何用正则表达式提取字符串中不匹配的部分?

I have a vector of extremely messy strings. Here is an example:

我有一个非常杂乱的向量。这是一个例子:

library(tidyverse)
library(stringr)
strings <- tibble(
  name = c("lorem 11:07:59 86136-1-sed", 
           "ipsum 14:35:57 S VARNAME-ut",
           "dolor 10:37:53 1513 -2-perspiciatis",
           "sit 10:48:25",
           "amet 13:52:1365293-2-unde",
           "consectetur 11:53:1 16018-2-omnis",
           "adipiscing 11:19 17237-2-iste"
           )
)
strings_out <- strings %>% 
  mutate(heads = str_extract(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}")) %>% 
  mutate(ends = str_replace(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}", ""))
strings_out[,2:3]
#> # A tibble: 7 x 2
#>                 heads                          ends
#>                 <chr>                         <chr>
#> 1      lorem 11:07:59                   86136-1-sed
#> 2      ipsum 14:35:57                  S VARNAME-ut
#> 3      dolor 10:37:53          1513 -2-perspiciatis
#> 4        sit 10:48:25                              
#> 5       amet 13:52:13                  65293-2-unde
#> 6 consectetur 11:53:1                 16018-2-omnis
#> 7                <NA> adipiscing 11:19 17237-2-iste

So here I have strings that feature some text, followed by a time that may or may not be entered correctly, then some more text. I want to extract just the ends of the strings after the time, however they do not have any pattern that seems to correspond well to a potential regular expression using str_extract. I can easily match the first half of the strings, shown in heads. However, the only way that I found to extract the last half is to use str_replace with an empty string, as shown in ends.

这里我有一些字符串，这些字符串包含一些文本，然后是一个可能不正确输入的时间，然后是更多的文本。我只想在之后提取字符串的末端，但是它们没有任何模式，看起来与使用str_extract的潜在正则表达式非常匹配。我可以很容易地匹配字符串的前半部分，如正面所示。然而，我发现提取后半部分的唯一方法是使用一个空字符串str_replace，如结尾所示。

I tried to include all the common errors that I noticed in this list: no pattern as to the hyphenation, spacing or string contents after the time, no guaranteed space betwene the time and the desired end half of the string, times missing digits or even colons.

我试着包括我在这个列表中注意到的所有常见错误:没有对时间后的连字符、间隔或字符串内容的模式，没有在时间和字符串的期望结束部分前的保证空间，乘以丢失的数字甚至冒号。

What I would like to do is to be able to use str_extract to get something close to what I got with str_replace. The key difference is that for the errors where this regex still does not work, str_extract gives me an NA that is easy to filter for and fix manually, but str_replace just copies in the whole string as seen in row 7.

我想做的是能够使用str_extract来获得与str_replace类似的东西。关键的区别在于，对于这个regex仍然不起作用的错误，str_extract为我提供了一个易于手动过滤和修复的NA，但是str_replace只复制整个字符串，如第7行所示。

I suspect I could do this with some more hacky methods, like getting all the NA and fixing manually in Excel or something, but I was surprised that I could not figure out how to return the unmatched portion of a string in general despite a bunch of searching and trying different regular expressions that include (^) and [^]. Any ideas?

我想我可以做一些更多的出租汽车司机的方法,像所有的手动NA和修复Excel之类的,但我很惊讶,我无法弄清楚如何返回一个字符串的无与伦比的部分一般尽管一堆搜索和尝试不同的正则表达式,包括(^)和[^]。什么好主意吗?

3 个解决方案

#1

In general, you'll probably want to look into lookarounds, but your data might need more structure for them to be useful.

一般来说，您可能会想要查看大量的查询，但是您的数据可能需要更多的结构来帮助它们。

Here's a quick example I wrote before realizing the time doesn't always have a space after it:

下面是我写的一个简单的例子，在意识到时间并不总是在它之后有一个空格之前:

library(tidyverse)
library(stringr)
strings <- tibble(
  name = c("lorem 11:07:59 86136-1-sed", 
           "ipsum 14:35:57 S VARNAME-ut",
           "dolor 10:37:53 1513 -2-perspiciatis",
           "sit 10:48:25",
           "amet 13:52:1365293-2-unde",
           "consectetur 11:53:1 16018-2-omnis",
           "adipiscing 11:19 17237-2-iste"
  )
)
strings_out <- strings %>% 
  mutate(heads = str_extract(name, "^.*?\\s\\d{1,2}:\\d{1,2}:\\d{1,2}"),
         ends = str_extract(name, "(?<=:\\d{1,2} )[\\s\\S]+$"))

strings_out[c(1,3)]
#> # A tibble: 7 x 2
#>                                  name                 ends
#>                                 <chr>                <chr>
#> 1          lorem 11:07:59 86136-1-sed          86136-1-sed
#> 2         ipsum 14:35:57 S VARNAME-ut         S VARNAME-ut
#> 3 dolor 10:37:53 1513 -2-perspiciatis 1513 -2-perspiciatis
#> 4                        sit 10:48:25                 <NA>
#> 5           amet 13:52:1365293-2-unde                 <NA>
#> 6   consectetur 11:53:1 16018-2-omnis        16018-2-omnis
#> 7       adipiscing 11:19 17237-2-iste         17237-2-iste

The problem here is lines like line 5. Without more structure, we can't know if the time is 13:52:13 or 13:52:1, as both are options present in other strings. Figuring out which is correct is not a problem that can be solved with regular expressions.

这里的问题是像第5行这样的线。没有更多的结构，我们无法知道时间是13:52:13还是13:52:2:1，因为这两个都是其他字符串中的选项。确定哪一个是正确的不是一个可以用正则表达式来解决的问题。

#2

You can also try this:

你也可以试试这个:

library(tidyverse)
library(stringr)

regex = "^\\w+\\s\\d{2}:\\d{2}:*\\d{0,2}"

strings %>%
  mutate(head = str_extract(name, regex),
         end = str_replace(name, paste0(regex, "\\s?"), ""),
         end = str_replace(end, "^\\s*$", NA_character_))

Result:

结果:

# A tibble: 7 x 3
                                 name                head                  end
                                <chr>               <chr>                <chr>
1          lorem 11:07:59 86136-1-sed      lorem 11:07:59          86136-1-sed
2         ipsum 14:35:57 S VARNAME-ut      ipsum 14:35:57         S VARNAME-ut
3 dolor 10:37:53 1513 -2-perspiciatis      dolor 10:37:53 1513 -2-perspiciatis
4                        sit 10:48:25        sit 10:48:25                 <NA>
5           amet 13:52:1365293-2-unde       amet 13:52:13         65293-2-unde
6   consectetur 11:53:1 16018-2-omnis consectetur 11:53:1        16018-2-omnis
7       adipiscing 11:19 17237-2-iste    adipiscing 11:19         17237-2-iste

Note:

注意:

My solution works for row 5, but you will have to decide whether you want to extract 13:52:13 or 13:52:1 in this case. Either cases can be done with simple modification to the regex, but as stated by @Zach, there is no automatic way.

我的解决方案适用于第5行，但是在这种情况下，您必须决定是否要提取13 52:13或13:51 2:1。任何一种情况都可以通过对regex的简单修改来完成，但是正如@Zach所说，没有自动的方法。

#3

You can have it with just one additional line:

你只需要多写一行:

strings["rx"] <- str_match(strings$name, "\\d*:\\d*(?::\\d+)?(.*)")[,2]
strings

Which yields

的收益率

# A tibble: 7 x 2
                                 name                    rx
                                <chr>                 <chr>
1          lorem 11:07:59 86136-1-sed           86136-1-sed
2         ipsum 14:35:57 S VARNAME-ut          S VARNAME-ut
3 dolor 10:37:53 1513 -2-perspiciatis  1513 -2-perspiciatis
4                        sit 10:48:25                      
5           amet 13:52:1365293-2-unde               -2-unde
6   consectetur 11:53:1 16018-2-omnis         16018-2-omnis
7       adipiscing 11:19 17237-2-iste          17237-2-iste

#1