如何使用Tidyverse根据另一列的值删除选择重复值

library(tidyverse)

Using the example data at the bottom, I'm trying to remove duplicates in the ID column, but only the duplicates where the "Year" column equals 2017.

使用底部的示例数据,我试图删除ID列中的重复项,但只删除“Year”列等于2017的重复项。

I tried the code below, which doesn't seem to work.

我尝试了下面的代码,这似乎不起作用。

DF <- DF %>% 
  group_by(ID) %>% 
  mutate(REMOVE = if_else(duplicated(ID) & Year == 2017, 1, 0))

DF <- DF %>% 
  group_by(ID) %>% 
  mutate(REMOVE = if_else(!unique(ID) & Year == 2017, 1, 0))

I'm trying to use the code to group by "ID", then use an "if_else" statement to code year 2017 within each group of duplicate ID's with a 1. I'll then remove all the 1's with the filter code below.

我正在尝试使用代码按“ID”进行分组,然后使用“if_else”语句在每组重复ID的代码中编码2017年,然后使用1.我将使用下面的过滤器代码删除所有1。

DF <- DF %>%
  filter(REMOVE == 1)

I'm not sure why this code isn't working. I've also tried changing the column types for ID and Year from character, numeric, etc, but this didn't help.

我不确定为什么这段代码不起作用。我也尝试从字符,数字等更改ID和Year的列类型,但这没有帮助。

Help would be appreciated!

帮助将不胜感激!

ID<-c(18998878,8888888,57485746,18998878,45454536,64536475,64536475,87966666,58675844,58695847,68574443,87966666)
Program<-c("A111","B488","T687","A111","G888","T444","T444","P867","R444","B323","F888","P867")
Code<-c(1222,4534,543,1222,4678,6544,6544,9898,8888,5656,6666,9898)
Year<-c(2016,2016,2017,2017,2017,2017,2016,2016,2016,2017,2017,2017)
DF<-data_frame(ID,Program,Code,Year)

3 个解决方案

#1

Sort DF by ID and Year then use distinct to keep only Year = 2016 values

按ID和年份排序DF,然后使用distinct来仅保留Year = 2016的值

library(dplyr)

ID <- c(18998878,8888888,57485746,18998878,45454536,64536475,64536475,87966666,
        58675844,58695847,68574443,87966666)
Program <- c("A111","B488","T687","A111","G888","T444","T444","P867","R444","B323","F888","P867")
Code <- c(1222,4534,543,1222,4678,6544,6544,9898,8888,5656,6666,9898)
Year <- c(2016,2016,2017,2017,2017,2017,2016,2016,2016,2017,2017,2017)
DF <- data_frame(ID,Program,Code,Year)
DF
#> # A tibble: 12 x 4
#>           ID Program  Code  Year
#>        <dbl> <chr>   <dbl> <dbl>
#>  1 18998878. A111    1222. 2016.
#>  2  8888888. B488    4534. 2016.
#>  3 57485746. T687     543. 2017.
#>  4 18998878. A111    1222. 2017.
#>  5 45454536. G888    4678. 2017.
#>  6 64536475. T444    6544. 2017.
#>  7 64536475. T444    6544. 2016.
#>  8 87966666. P867    9898. 2016.
#>  9 58675844. R444    8888. 2016.
#> 10 58695847. B323    5656. 2017.
#> 11 68574443. F888    6666. 2017.
#> 12 87966666. P867    9898. 2017.


DF %>% 
  arrange(ID, Year) %>% 
  distinct(ID, .keep_all = TRUE)
#> # A tibble: 9 x 4
#>          ID Program  Code  Year
#>       <dbl> <chr>   <dbl> <dbl>
#> 1  8888888. B488    4534. 2016.
#> 2 18998878. A111    1222. 2016.
#> 3 45454536. G888    4678. 2017.
#> 4 57485746. T687     543. 2017.
#> 5 58675844. R444    8888. 2016.
#> 6 58695847. B323    5656. 2017.
#> 7 64536475. T444    6544. 2016.
#> 8 68574443. F888    6666. 2017.
#> 9 87966666. P867    9898. 2016.

Created on 2018-03-07 by the reprex package (v0.2.0).

由reprex包创建于2018-03-07(v0.2.0)。

#2

ID<-c(18998878,8888888,57485746,18998878,45454536,64536475,64536475,87966666,58675844,58695847,68574443,87966666)
Program<-c("A111","B488","T687","A111","G888","T444","T444","P867","R444","B323","F888","P867")
Code<-c(1222,4534,543,1222,4678,6544,6544,9898,8888,5656,6666,9898)
Year<-c(2016,2016,2017,2017,2017,2017,2016,2016,2016,2017,2017,2017)
DF<-data_frame(ID,Program,Code,Year)

filter(DF, (! duplicated(ID)) & Year == 2017)

This removes the second or later occurrence of any ID, provided the year is 2017. Of note there are no examples of that, so I may have misunderstood your question.

如果年份是2017年,这将删除第二次或以后出现的任何ID。值得注意的是,没有任何示例,所以我可能误解了您的问题。

#3

You break it into two data frames, one with year equals 2017 and one with year does not equal 2017.

您将其分成两个数据框,一个年份等于2017年,一个年份不等于2017年。

 DF1 <- DF %>% filter(Year==2017) 
 DF2 <- DF %>% filter(Year!=2017)

Then dedup on DF1 by its ID column using distinct(). Keep_all is to keep the rest values.

然后使用distinct()通过ID列对DF1进行重复数据删除。 Keep_all是保留其余值。

 DF3 <- DF1 %>% distinct(ID,.keep_all = T)

Now you can get you final result by combine DF2 and DF3 with rbind()

现在,您可以通过将DF2和DF3与rbind()结合使用来获得最终结果

 df_all <- rbind(DF2,DF3)

#1