过滤匹配矢量的所有值的数据帧

时间:2023-01-14 20:12:34

I want to filter data frame x by including IDs that contain rows for Hour that match all values of testVector.

我想通过包含包含与HourVall的所有值匹配的Hour行的ID来过滤数据框x。

ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')

x <- data.frame(ID, Hour)
x
   ID Hour
1   A    0
2   A    2
3   A    5
4   A    6
5   A    9
6   B    0
7   B    2
8   B    5
9   B    6
10  C    0
11  C    2

testVector <- c('0','2','5')

The solution should yield the following data frame:

解决方案应该产生以下数据框:

x
       ID Hour
    1   A    0
    2   A    2
    3   A    5
    4   A    6
    5   A    9
    6   B    0
    7   B    2
    8   B    5
    9   B    6

All values of ID C were dropped because it was missing Hour 5. Note that I want to keep all values of Hour for IDs that match testVector.

ID C的所有值都被删除,因为它缺少第5小时。请注意,我想保留与testVector匹配的ID的所有小时值。

A dplyr solution would be ideal, but any solution is welcome.

dplyr解决方案是理想的,但欢迎任何解决方案。

Based on other related questions on SO, I'm guessing I need some combination of %in% and all, but I can't quite figure it out.

根据关于SO的其他相关问题,我猜我需要%in%和all的组合,但我无法弄明白。

3 个解决方案

#1


2  

Here's another dplyr solution without ever leaving the pipe:

这是另一个没有离开管道的dplyr解决方案:

ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')

x <- data.frame(ID, Hour)

testVector <- c('0','2','5')

x %>%
  group_by(ID) %>%
  mutate(contains = Hour %in% testVector) %>%
  summarise(all = sum(contains)) %>%
  filter(all > 2) %>%
  select(-all) %>%
  inner_join(x)

##       ID   Hour
##   <fctr> <fctr>
## 1      A      0
## 2      A      2
## 3      A      5
## 4      A      6
## 5      A      9
## 6      B      0
## 7      B      2
## 8      B      5
## 9      B      6

#2


4  

Your combination of %in% and all sounds promising, in base R you could use those to your advantage as follows:

你的百分比%和所有听起来很有希望的组合,在基础R你可以使用这些有利于你,如下:

to_keep = sapply(lapply(split(x,x$ID),function(x) {unique(x$Hour)}), 
                                              function(x) {all(testVector %in% x)})
x = x[x$ID %in% names(to_keep)[to_keep],]

Or similiarly, but skipping an unneccessary lapply and more efficient as per d.b. in the comments:

或者类似地,但是根据d.b.跳过一个不必要的lapply并且更有效率。在评论中:

temp = sapply(split(x, x$ID), function(a) all(testVector %in% a$Hour))
x[temp[match(x$ID, names(temp))],]

Output:

  ID Hour
1  A    0
2  A    2
3  A    5
4  A    6
5  A    9
6  B    0
7  B    2
8  B    5
9  B    6

Hope this helps!

希望这可以帮助!

#3


2  

Here is an option using table from base R

这是一个使用基础R表格的选项

i1 <- !rowSums(table(x)[, testVector]==0)
subset(x, ID %in% names(i1)[i1])
#   ID Hour
#1  A    0
#2  A    2
#3  A    5
#4  A    6
#5  A    9
#6  B    0
#7  B    2
#8  B    5
#9  B    6

Or this can be done with data.table

或者这可以通过data.table完成

library(data.table)
setDT(x)[, .SD[all(testVector %in% Hour)], ID]
#    ID Hour
#1:  A    0
#2:  A    2
#3:  A    5
#4:  A    6
#5:  A    9
#6:  B    0
#7:  B    2
#8:  B    5
#9:  B    6

#1


2  

Here's another dplyr solution without ever leaving the pipe:

这是另一个没有离开管道的dplyr解决方案:

ID <- c('A','A','A','A','A','B','B','B','B','C','C')
Hour <- c('0','2','5','6','9','0','2','5','6','0','2')

x <- data.frame(ID, Hour)

testVector <- c('0','2','5')

x %>%
  group_by(ID) %>%
  mutate(contains = Hour %in% testVector) %>%
  summarise(all = sum(contains)) %>%
  filter(all > 2) %>%
  select(-all) %>%
  inner_join(x)

##       ID   Hour
##   <fctr> <fctr>
## 1      A      0
## 2      A      2
## 3      A      5
## 4      A      6
## 5      A      9
## 6      B      0
## 7      B      2
## 8      B      5
## 9      B      6

#2


4  

Your combination of %in% and all sounds promising, in base R you could use those to your advantage as follows:

你的百分比%和所有听起来很有希望的组合,在基础R你可以使用这些有利于你,如下:

to_keep = sapply(lapply(split(x,x$ID),function(x) {unique(x$Hour)}), 
                                              function(x) {all(testVector %in% x)})
x = x[x$ID %in% names(to_keep)[to_keep],]

Or similiarly, but skipping an unneccessary lapply and more efficient as per d.b. in the comments:

或者类似地,但是根据d.b.跳过一个不必要的lapply并且更有效率。在评论中:

temp = sapply(split(x, x$ID), function(a) all(testVector %in% a$Hour))
x[temp[match(x$ID, names(temp))],]

Output:

  ID Hour
1  A    0
2  A    2
3  A    5
4  A    6
5  A    9
6  B    0
7  B    2
8  B    5
9  B    6

Hope this helps!

希望这可以帮助!

#3


2  

Here is an option using table from base R

这是一个使用基础R表格的选项

i1 <- !rowSums(table(x)[, testVector]==0)
subset(x, ID %in% names(i1)[i1])
#   ID Hour
#1  A    0
#2  A    2
#3  A    5
#4  A    6
#5  A    9
#6  B    0
#7  B    2
#8  B    5
#9  B    6

Or this can be done with data.table

或者这可以通过data.table完成

library(data.table)
setDT(x)[, .SD[all(testVector %in% Hour)], ID]
#    ID Hour
#1:  A    0
#2:  A    2
#3:  A    5
#4:  A    6
#5:  A    9
#6:  B    0
#7:  B    2
#8:  B    5
#9:  B    6