来自数据框的子集不匹配行由第二个数据帧中多个列中的值构成

时间:2022-10-11 01:38:44

I would like to "extract" rows of data from a tab separated data frame (df1) where the entries only in columns 1 2 and 3 are not found in a second data frame (df2), but keeping all of the column values in df1.

我想从制表符分隔数据框(df1)中“提取”数据行,其中仅在第1和第3列中的条目未在第二个数据框(df2)中找到,但是将所有列值保留在df1中。

Here is a minimal example

这是一个最小的例子

df1

chr     start       end         pvalue      S1          S2
chr10   100028205   100028508   8.97E-01    3.0373832   3.6170213
chr10   100227439   100227832   5.04E-14    10.6730769  2.7279813
chr10   100992157   100992687   6.66E-03    12.6997477  17.3807599
chr10   100993821   100994188   9.94E-01    2.4369017   2.2819886
chr10   101089011   101090655   1.48E-07    6.6696846   9.3321407
chr10   101190452   101190925   5.37E-01    0.9708738   0.5974608
chr10   101279942   101280382   4.72E-03    7.2614108   11.8119266
chr10   101281182   101282116   1.34E-01    20.0733945  22.3736969
chr10   101282726   101282934   3.02E-01    15.7142857  19.6261682
chr10   101287163   101287920   6.95E-01    24.543379   25.7190265

my actual data set will have a variety of chr numbers in "chr" and thousands more rows and a few more columns of data

我的实际数据集将在“chr”中包含各种chr数字,还有数千行和几列数据

df2

chr     start       end         
chr10   100227439   100227832   
chr10   100992157   100992687   
chr10   101089011   101090655   
chr10   101287163   101287920   

Desired output:

期望的输出:

df3

chr     start       end         pvalue      S1          S2
chr10   100028205   100028508   8.97E-01    3.0373832   3.6170213
chr10   100993821   100994188   9.94E-01    2.4369017   2.2819886
chr10   101190452   101190925   5.37E-01    0.9708738   0.5974608
chr10   101279942   101280382   4.72E-03    7.2614108   11.8119266
chr10   101281182   101282116   1.34E-01    20.0733945  22.3736969
chr10   101282726   101282934   3.02E-01    15.7142857  19.6261682

I have tried a variety of commands including:

我尝试了各种命令,包括:

df3 <- df1[!(df1[,1:3] %in% df2[,1:3])]

which returns all of df1

返回所有df1

df3 <- df1[!(df1$chr & df1$start & df1$end) %in% df2$chr & df2$start & df2$end]

errors

错误

1 个解决方案

#1


1  

Assuming both df1 and df2 are data frames.

假设df1和df2都是数据帧。

library(dplyr)
anti_join(df1, df2)
# Joining by: c("chr", "start", "end")
#     chr     start       end  pvalue         S1         S2
# 1 chr10 101282726 101282934 0.30200 15.7142857 19.6261682
# 2 chr10 101281182 101282116 0.13400 20.0733945 22.3736969
# 3 chr10 101279942 101280382 0.00472  7.2614108 11.8119266
# 4 chr10 101190452 101190925 0.53700  0.9708738  0.5974608
# 5 chr10 100993821 100994188 0.99400  2.4369017  2.2819886
# 6 chr10 100028205 100028508 0.89700  3.0373832  3.6170213

#1


1  

Assuming both df1 and df2 are data frames.

假设df1和df2都是数据帧。

library(dplyr)
anti_join(df1, df2)
# Joining by: c("chr", "start", "end")
#     chr     start       end  pvalue         S1         S2
# 1 chr10 101282726 101282934 0.30200 15.7142857 19.6261682
# 2 chr10 101281182 101282116 0.13400 20.0733945 22.3736969
# 3 chr10 101279942 101280382 0.00472  7.2614108 11.8119266
# 4 chr10 101190452 101190925 0.53700  0.9708738  0.5974608
# 5 chr10 100993821 100994188 0.99400  2.4369017  2.2819886
# 6 chr10 100028205 100028508 0.89700  3.0373832  3.6170213