如何从data.frame中删除两个特定列缺少值的行?

时间:2021-10-03 19:19:19

Say I write the following code to produce a dataframe:

假设我编写以下代码来生成数据帧:

name <- c("Joe","John","Susie","Mack","Mo","Curly","Jim")
age <- c(1,2,3,NaN,4,5,NaN)
DOB <- c(10000, 12000, 16000, NaN, 18000, 20000, 22000)
DOB <- as.Date(DOB, origin = "1960-01-01")
trt <- c(0, 1, 1, 2, 2, 1, 1)
df <- data.frame(name, age, DOB, trt)

that looks like this:

看起来像这样:

   name age        DOB trt
1   Joe   1 1987-05-19   0
2  John   2 1992-11-08   1
3 Susie   3 2003-10-22   1
4  Mack NaN       <NA>   2
5    Mo   4 2009-04-13   2
6 Curly   5 2014-10-04   1
7   Jim NaN 2020-03-26   1

How would I be able to remove rows where both age and DOB have missing values for the row? For example, I'd like a new dataframe (df2) to look like this:

如何删除年龄和DOB都缺少该行的值的行?例如,我想要一个新的数据帧(df2)看起来像这样:

   name age        DOB trt
1   Joe   1 1987-05-19   0
2  John   2 1992-11-08   1
3 Susie   3 2003-10-22   1
5    Mo   4 2009-04-13   2
6 Curly   5 2014-10-04   1
7   Jim NaN 2020-03-26   1

I've tried the following code, but it deleted too many rows:

我尝试了以下代码,但它删除了太多行:

df2 <- df[!(is.na(df$age)) & !(is.na(df$DOB)), ]

In SAS, I would just write WHERE missing(age) ge 1 AND missing(DOB) ge 1 in a DATA step, but obviously R has different syntax.

在SAS中,我只会在DATA步骤中写入WHERE missing(age)ge 1 AND missing(DOB)ge 1,但显然R具有不同的语法。

Thanks in advance!

提前致谢!

2 个解决方案

#1


1  

If you want to remove those rows where two columns (age and DOB) have more than 1 NA (which would mathematically mean that there could only be 2 NAs in such a case), you can do for example:

如果要删除两列(年龄和DOB)具有多于1个NA的行(这在数学上意味着在这种情况下只能有2个NA),您可以执行以下操作:

df[!is.na(df$age) | !is.na(df$DOB),]

which means that either both or one of the columns should be not NA, or

这意味着两个列或其中一个列不应该是NA,或者

df[rowSums(is.na(df[2:3])) < 2L,]

which means that the sum of NAs in columns 2 and 3 should be less than 2 (hence, 1 or 0) or very similar:

这意味着第2列和第3列中的NA之和应小于2(因此,1或0)或非常相似:

df[rowSums(is.na(df[c("age", "DOB")])) < 2L,]

And of course there's other options, like what @rawr provided in the comments.

当然还有其他选择,比如@rawr在评论中提供的内容。

And to better understand the subsetting, check this:

为了更好地理解子集,请检查以下内容:

rowSums(is.na(df[2:3]))
#[1] 0 0 0 2 0 0 1

rowSums(is.na(df[2:3])) < 2L
#[1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

#2


0  

You were pretty close

你非常接近

df[!(is.na(df$age) & is.na(df$DOB)), ]

or

要么

df[!is.na(df$age) | !is.na(df$DOB), ]

#1


1  

If you want to remove those rows where two columns (age and DOB) have more than 1 NA (which would mathematically mean that there could only be 2 NAs in such a case), you can do for example:

如果要删除两列(年龄和DOB)具有多于1个NA的行(这在数学上意味着在这种情况下只能有2个NA),您可以执行以下操作:

df[!is.na(df$age) | !is.na(df$DOB),]

which means that either both or one of the columns should be not NA, or

这意味着两个列或其中一个列不应该是NA,或者

df[rowSums(is.na(df[2:3])) < 2L,]

which means that the sum of NAs in columns 2 and 3 should be less than 2 (hence, 1 or 0) or very similar:

这意味着第2列和第3列中的NA之和应小于2(因此,1或0)或非常相似:

df[rowSums(is.na(df[c("age", "DOB")])) < 2L,]

And of course there's other options, like what @rawr provided in the comments.

当然还有其他选择,比如@rawr在评论中提供的内容。

And to better understand the subsetting, check this:

为了更好地理解子集,请检查以下内容:

rowSums(is.na(df[2:3]))
#[1] 0 0 0 2 0 0 1

rowSums(is.na(df[2:3])) < 2L
#[1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE

#2


0  

You were pretty close

你非常接近

df[!(is.na(df$age) & is.na(df$DOB)), ]

or

要么

df[!is.na(df$age) | !is.na(df$DOB), ]