如何在数据框架上使用grep ?

时间:2023-01-20 17:13:13

I have the following data frame:

我有以下数据框架:

> my.data
  A.Seats    B.Seats
1   14,15   14,15,16
2       7        7,8
3   12,13      16,17
4    <NA>      10,11

I would like to check if the string within any row in column "A.Seats" is found within the same row of column "B.Seats". So the output would look something like this:

我想检查A列中任意一行中的字符串。“座位”是在同一排“b”栏中找到的。输出是这样的

  A.Seats    B.Seats    Check
1   14,15   14,15,16     TRUE
2       7        7,8     TRUE
3   12,13      16,17    FALSE
4    <NA>      10,11    FALSE

But I don't know how to create this table. As a start, I tried using grep:

但我不知道如何创建这个表。首先,我尝试使用grep:

grep(my.data$A.Seats,my.data$B.Seats)

But I receive the following output

但我收到如下输出

[1] 1
Warning message:
In grep(my.data$A.Seats, my.data$B.Seats) :
argument 'pattern' has length > 1 and only the first element will be used

...and I can't get past this error. Any ideas as to how I can get the intended result?

…我无法摆脱这个错误。对于如何得到预期的结果有什么想法吗?

Many Thanks

非常感谢

2 个解决方案

#1


1  

The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:

“stringi”库有一些向量化的函数,这些函数对于类似的东西可能有用。我建议使用stri_detect()函数。这里有一个示例,其中包含一些可重复的示例数据。注意第一行和最后一行的值之间的差异,以及根据采用regex还是固定方法而得到的结果的差异:

my.data <- data.frame(
    A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
    B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
#   A.Seats  B.Seats
# 1   14,15 14,15,16
# 2       7      7,8
# 3   12,13    16,17
# 4    <NA>    10,11
# 5   14,19 14,15,16

library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1]  TRUE  TRUE FALSE    NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1]  TRUE  TRUE FALSE    NA  TRUE

The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.

第一个选项处理my.data$A中的值。座椅作为一个固定的弦模式。第二个选项将它作为一个正则表达式来匹配任何值。

Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.

注意,这将保持NA为NA,但如果需要,可以很容易地将其更改为FALSE。


If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:

如果您不想过多地考虑mapply,可以考虑使用Vectorize制作一个向量化的grepl版本。下面这样的东西应该可以做到:

vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats)                 # pattern is fixed
# [1]  1  1  0 NA  0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15     7 12|13  <NA> 14|19 
#     1     1     0    NA     1 
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats))     # coerce to logical
# [1]  TRUE  TRUE FALSE    NA FALSE

Because this calls grepl on each element in the vector, I don't think this will scale well though.

因为它在向量中的每个元素上都调用grepl,所以我认为它不能很好地伸缩。

#2


1  

This is an approach to get what you need

这是一种获取所需的方法

> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
  A.Seats  B.Seats Check
1   14,15 14,15,16  TRUE
2       7      7,8  TRUE
3   12,13    16,17 FALSE
4    <NA>    10,11 FALSE

Here's an alternative using grep

这里有一个使用grep的替代方法

>transform(my.data, 
          Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))

#1


1  

The "stringi" library has some vectorized functions that might be useful for something like this. I would suggest the stri_detect() function. Here's an example with some reproducible sample data. Note the difference in the values in the first and last row, and the difference in the results according to whether a regex or fixed approach was taken:

“stringi”库有一些向量化的函数,这些函数对于类似的东西可能有用。我建议使用stri_detect()函数。这里有一个示例,其中包含一些可重复的示例数据。注意第一行和最后一行的值之间的差异,以及根据采用regex还是固定方法而得到的结果的差异:

my.data <- data.frame(
    A.Seats = c("14,15", "7", "12,13", NA, "14,19"),
    B.Seats = c("14,15,16", "7,8", "16,17", "10,11", "14,15,16"))
my.data
#   A.Seats  B.Seats
# 1   14,15 14,15,16
# 2       7      7,8
# 3   12,13    16,17
# 4    <NA>    10,11
# 5   14,19 14,15,16

library(stringi)
stri_detect(my.data$B.Seats, fixed = my.data$A.Seats)
# [1]  TRUE  TRUE FALSE    NA FALSE
stri_detect(my.data$B.Seats, regex = gsub(",", "|", my.data$A.Seats))
# [1]  TRUE  TRUE FALSE    NA  TRUE

The first option above treats the values in my.data$A.Seats as a fixed string pattern. The second option treats it as a regular expression to match any of the values.

第一个选项处理my.data$A中的值。座椅作为一个固定的弦模式。第二个选项将它作为一个正则表达式来匹配任何值。

Note that this maintains NA as NA, but that can easily be changed to FALSE if you need to.

注意,这将保持NA为NA,但如果需要,可以很容易地将其更改为FALSE。


If you don't want to think too much about mapply, you can consider Vectorize to make a vectorized version of grepl. Something like the following should do it:

如果您不想过多地考虑mapply,可以考虑使用Vectorize制作一个向量化的grepl版本。下面这样的东西应该可以做到:

vGrepl <- Vectorize(grepl)
vGrepl(my.data$A.Seats, my.data$B.Seats)                 # pattern is fixed
# [1]  1  1  0 NA  0
vGrepl(gsub(",", "|", my.data$A.Seats), my.data$B.Seats) # pattern is regex
# 14|15     7 12|13  <NA> 14|19 
#     1     1     0    NA     1 
as.logical(vGrepl(my.data$A.Seats, my.data$B.Seats))     # coerce to logical
# [1]  TRUE  TRUE FALSE    NA FALSE

Because this calls grepl on each element in the vector, I don't think this will scale well though.

因为它在向量中的每个元素上都调用grepl,所以我认为它不能很好地伸缩。

#2


1  

This is an approach to get what you need

这是一种获取所需的方法

> List <- lapply(my.data, function(x) strsplit(as.character(x), ","))
> transform(my.data, Check=sapply(mapply("%in%", List[[1]], List[[2]]), any))
  A.Seats  B.Seats Check
1   14,15 14,15,16  TRUE
2       7      7,8  TRUE
3   12,13    16,17 FALSE
4    <NA>    10,11 FALSE

Here's an alternative using grep

这里有一个使用grep的替代方法

>transform(my.data, 
          Check=sapply(suppressWarnings(mapply("grep", List[[1]], List[[2]])), any))