构造子集的数据。表使用!=也排除NA

时间:2020-12-16 22:47:20

I have a data.table with a column that has NAs. I want to drop rows where that column takes a particular value (which happens to be ""). However, my first attempt lead me to lose rows with NAs as well:

我有一个数据。包含NAs的列的表。我想删除列中包含特定值(恰好是“”)的行。然而,我的第一次尝试也使我在NAs上丢失了一些行:

> a = c(1,"",NA)
> x <- data.table(a);x
    a
1:  1
2:   
3: NA
> y <- x[a!=""];y
   a
1: 1

After looking at ?`!=`, I found a one liner that works, but it's a pain:

后在看什么?”!= ',我找到了一个管用的眼线,但很痛苦:

> z <- x[!sapply(a,function(x)identical(x,""))]; z
    a
1:  1
2: NA

I'm wondering if there's a better way to do this? Also, I see no good way of extending this to excluding multiple non-NA values. Here's a bad way:

我想知道有没有更好的方法来做这件事?此外,我认为没有很好的方法可以扩展到排除多个非na值。这是一个坏的方式:

>     drop_these <- function(these,where){
+         argh <- !sapply(where,
+             function(x)unlist(lapply(as.list(these),function(this)identical(x,this)))
+         )
+         if (is.matrix(argh)){argh <- apply(argh,2,all)}
+         return(argh)
+     }
>     x[drop_these("",a)]
    a
1:  1
2: NA
>     x[drop_these(c(1,""),a)]
    a
1: NA

I looked at ?J and tried things out with a data.frame, which seems to work differently, keeping NAs when subsetting:

我看了看?J,用data.frame做了一些尝试,这个框架的工作方式似乎有所不同,在细分时保留NAs:

> w <- data.frame(a,stringsAsFactors=F); w
     a
1    1
2     
3 <NA>
> d <- w[a!="",,drop=F]; d
      a
1     1
NA <NA>

3 个解决方案

#1


15  

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

你应该用% %。它返回一个逻辑向量。

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA

To find out why this is happening in data.table:

(as opposted to data.frame)

(opposted data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

如果你看看数据。表源代码。函数下的R "[.data。表",有一组if语句检查i参数。其中一个是:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

为了解释差异,我在这里粘贴了一段重要的代码。我还把它们标记为3个部分。

First, why dt[a != ""] doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

首先,第1部分计算一个类调用的对象。第2部分中的if语句的第二部分返回FALSE。在此之后,调用将“求值”以给出c(TRUE、FALSE、NA)。然后执行第3部分。因此,NA被替换为FALSE(逻辑循环的最后一行)。

why does x[!(a== "")] work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

第1部分再次返回一个调用。但是,第2部分评估为TRUE,因此设置:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

这就是奇迹发生的地方。否定现在已经被删除了。记住,这仍然是类调用的对象。因此,这将被评估(使用eval)再次符合逻辑。因此,(a= "" ")计算为c(FALSE, TRUE, NA)

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

检查一下。在第3部分逻辑。这里,NA被替换为FALSE。因此它变成了c(FALSE, TRUE, FALSE)在之后的某个时刻,执行了a which(c(F,T,F)),结果是2。因为notjoin = TRUE(来自第2部分)返回seq_len(nrow(x))[-2] = c(1,3)。因此,x[!(a= "")]基本上返回x[c(1,3)],这是期望的结果。以下是相关代码片段:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

鉴于此,我认为语法有些不一致。如果我能抽出时间来解决这个问题,我很快就会写一篇文章。

#2


3  

As you have already figured out, this is the reason:

正如你已经知道的,这就是原因:

a != ""
#[1]  TRUE    NA FALSE

You can do what you figured out already, i.e. x[is.na(a) | a != ""] or you could setkey on a and do the following:

你可以做你已经知道的事情,例如x[is.na(a) | a != "]或者你可以设置a并做以下事情:

setkey(x, a)
x[!J("")]

#3


3  

Background answer from Matthew :

马修的背景回答:

The behaviour with != on NA as highlighted by this question wasn't intended, thinking about it. The original intention was indeed to be different than [.data.frame w.r.t. == and NA and I believe everyone is happy with that. For example, FAQ 2.17 has :

这个问题强调的与!= on NA的行为并不是故意的,想想看。最初的意图确实与[.data.frame w.r.t ==和NA不同,我相信每个人都对此感到高兴。例如,FAQ 2.17有:

DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]

DT[ColA= ColB]比DF[!is.na(ColA) & !is.na(ColB) & ColA= ColB,]

That convenience is achieved by dint of :

这种方便是通过以下方式实现的:

DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows for each NA

DT[c(TRUE,NA,FALSE)]将NA视为FALSE,而DF[c(TRUE,NA,FALSE)]则为每个NA返回NA行

The motivation is not just convenience but speed, since each and every !, is.na, & and == are themselves vector scans with associated memory allocation of each of their results (explained in intro vignette). So although x[is.na(a) | a!=""] is a working solution, it's exactly the type of logic I was trying to avoid needing in data.table. x[!a %in% ""] is slightly better; i.e, 2 scans (%in% and !) rather than 3 (is.na, | and !=). But really x[a != ""] should do what Frank expected (include NA) in a single scan.

它的动机不仅是方便,而且是速度,因为每一个!na, &和==本身是向量扫描,它们各自结果的相关内存分配(在简介中解释)。所以尽管x是。na(a) | a!=""]是一个有效的解决方案,它正是我试图避免在data.table中使用的逻辑类型。x[!%in% "]稍好一些;我。e, 2扫描(%in% in% and !)而不是3 (is)。na、|和! =)。但实际上x[a != ""]应该在一次扫描中完成弗兰克所期望的(包括NA)。

New feature request filed which links back to this question :

新功能请求文件链接到这个问题:

DT[col!=""] should include NA

DT(col != " ")应包括NA

Thanks to Frank, Eddi and Arun. If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where both colA and colB are NA (which it doesn't currently, I believe).

感谢Frank, Eddi和Arun。如果我还没有正确的理解,就可以*的去纠正,否则,最终会发生改变。它需要以考虑复合表达式的方式进行;例如,DT(可乐= =“foo”& colB !="bar"]应该排除可乐中含有NA的行,但包含可乐不是NA但含有NA的行。同样,DT(可乐!=colB)应该包括可乐或colB都是NA但不是两者都是NA的行。也许DT[colA= colB]应该包含可乐和colB都是NA的行(我认为目前没有)。

#1


15  

To provide a solution to your question:

You should use %in%. It gives you back a logical vector.

你应该用% %。它返回一个逻辑向量。

a %in% ""
# [1] FALSE  TRUE FALSE

x[!a %in% ""]
#     a
# 1:  1
# 2: NA

To find out why this is happening in data.table:

(as opposted to data.frame)

(opposted data.frame)

If you look at the data.table source code on the file data.table.R under the function "[.data.table", there's a set of if-statements that check for i argument. One of them is:

如果你看看数据。表源代码。函数下的R "[.data。表",有一组if语句检查i参数。其中一个是:

if (!missing(i)) {
    # Part (1)
    isub = substitute(i)

    # Part (2)
    if (is.call(isub) && isub[[1L]] == as.name("!")) {
        notjoin = TRUE
        if (!missingnomatch) stop("not-join '!' prefix is present on i but nomatch is provided. Please remove nomatch.");
        nomatch = 0L
        isub = isub[[2L]]
    }

    .....
    # "isub" is being evaluated using "eval" to result in a logical vector

    # Part 3
    if (is.logical(i)) {
        # see DT[NA] thread re recycling of NA logical
        if (identical(i,NA)) i = NA_integer_  
        # avoids DT[!is.na(ColA) & !is.na(ColB) & ColA==ColB], just DT[ColA==ColB]
        else i[is.na(i)] = FALSE  
    }
    ....
}

To explain the discrepancy, I've pasted the important piece of code here. And I've also marked them into 3 parts.

为了解释差异,我在这里粘贴了一段重要的代码。我还把它们标记为3个部分。

First, why dt[a != ""] doesn't work as expected (by the OP)?

First, part 1 evaluates to an object of class call. The second part of the if statement in part 2 returns FALSE. Following that, the call is "evaluated" to give c(TRUE, FALSE, NA) . Then part 3 is executed. So, NA is replaced to FALSE (the last line of the logical loop).

首先,第1部分计算一个类调用的对象。第2部分中的if语句的第二部分返回FALSE。在此之后,调用将“求值”以给出c(TRUE、FALSE、NA)。然后执行第3部分。因此,NA被替换为FALSE(逻辑循环的最后一行)。

why does x[!(a== "")] work as expected (by the OP)?

part 1 returns a call once again. But, part 2 evaluates to TRUE and therefore sets:

第1部分再次返回一个调用。但是,第2部分评估为TRUE,因此设置:

1) `notjoin = TRUE`
2) isub <- isub[[2L]] # which is equal to (a == "") without the ! (exclamation)

That is where the magic happened. The negation has been removed for now. And remember, this is still an object of class call. So this gets evaluated (using eval) to logical again. So, (a=="") evaluates to c(FALSE, TRUE, NA).

这就是奇迹发生的地方。否定现在已经被删除了。记住,这仍然是类调用的对象。因此,这将被评估(使用eval)再次符合逻辑。因此,(a= "" ")计算为c(FALSE, TRUE, NA)

Now, this is checked for is.logical in part 3. So, here, NA gets replaced to FALSE. It therefore becomes, c(FALSE, TRUE, FALSE). At some point later, a which(c(F,T,F)) is executed, which results in 2 here. Because notjoin = TRUE (from part 2) seq_len(nrow(x))[-2] = c(1,3) is returned. so, x[!(a=="")] basically returns x[c(1,3)] which is the desired result. Here's the relevant code snippet:

检查一下。在第3部分逻辑。这里,NA被替换为FALSE。因此它变成了c(FALSE, TRUE, FALSE)在之后的某个时刻,执行了a which(c(F,T,F)),结果是2。因为notjoin = TRUE(来自第2部分)返回seq_len(nrow(x))[-2] = c(1,3)。因此,x[!(a= "")]基本上返回x[c(1,3)],这是期望的结果。以下是相关代码片段:

if (notjoin) {
    if (bywithoutby || !is.integer(irows) || is.na(nomatch)) stop("Internal error: notjoin but bywithoutby or !integer or nomatch==NA")
    irows = irows[irows!=0L]
    # WHERE MAGIC HAPPENS (returns c(1,3))
    i = irows = if (length(irows)) seq_len(nrow(x))[-irows] else NULL  # NULL meaning all rows i.e. seq_len(nrow(x))
    # Doing this once here, helps speed later when repeatedly subsetting each column. R's [irows] would do this for each
    # column when irows contains negatives.
}

Given that, I think there are some inconsistencies with the syntax.. And if I manage to get time to formulate the problem, then I'll write a post soon.

鉴于此,我认为语法有些不一致。如果我能抽出时间来解决这个问题,我很快就会写一篇文章。

#2


3  

As you have already figured out, this is the reason:

正如你已经知道的,这就是原因:

a != ""
#[1]  TRUE    NA FALSE

You can do what you figured out already, i.e. x[is.na(a) | a != ""] or you could setkey on a and do the following:

你可以做你已经知道的事情,例如x[is.na(a) | a != "]或者你可以设置a并做以下事情:

setkey(x, a)
x[!J("")]

#3


3  

Background answer from Matthew :

马修的背景回答:

The behaviour with != on NA as highlighted by this question wasn't intended, thinking about it. The original intention was indeed to be different than [.data.frame w.r.t. == and NA and I believe everyone is happy with that. For example, FAQ 2.17 has :

这个问题强调的与!= on NA的行为并不是故意的,想想看。最初的意图确实与[.data.frame w.r.t ==和NA不同,我相信每个人都对此感到高兴。例如,FAQ 2.17有:

DT[ColA==ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA==ColB,]

DT[ColA= ColB]比DF[!is.na(ColA) & !is.na(ColB) & ColA= ColB,]

That convenience is achieved by dint of :

这种方便是通过以下方式实现的:

DT[c(TRUE,NA,FALSE)] treats the NA as FALSE, but DF[c(TRUE,NA,FALSE)] returns NA rows for each NA

DT[c(TRUE,NA,FALSE)]将NA视为FALSE,而DF[c(TRUE,NA,FALSE)]则为每个NA返回NA行

The motivation is not just convenience but speed, since each and every !, is.na, & and == are themselves vector scans with associated memory allocation of each of their results (explained in intro vignette). So although x[is.na(a) | a!=""] is a working solution, it's exactly the type of logic I was trying to avoid needing in data.table. x[!a %in% ""] is slightly better; i.e, 2 scans (%in% and !) rather than 3 (is.na, | and !=). But really x[a != ""] should do what Frank expected (include NA) in a single scan.

它的动机不仅是方便,而且是速度,因为每一个!na, &和==本身是向量扫描,它们各自结果的相关内存分配(在简介中解释)。所以尽管x是。na(a) | a!=""]是一个有效的解决方案,它正是我试图避免在data.table中使用的逻辑类型。x[!%in% "]稍好一些;我。e, 2扫描(%in% in% and !)而不是3 (is)。na、|和! =)。但实际上x[a != ""]应该在一次扫描中完成弗兰克所期望的(包括NA)。

New feature request filed which links back to this question :

新功能请求文件链接到这个问题:

DT[col!=""] should include NA

DT(col != " ")应包括NA

Thanks to Frank, Eddi and Arun. If I haven't understood correctly feel free to correct, otherwise the change will get made eventually. It will need to be done in a way that considers compound expressions; e.g., DT[colA=="foo" & colB!="bar"] should exclude rows with NA in colA but include rows where colA is non-NA but colB is NA. Similarly, DT[colA!=colB] should include rows where either colA or colB is NA but not both. And perhaps DT[colA==colB] should include rows where both colA and colB are NA (which it doesn't currently, I believe).

感谢Frank, Eddi和Arun。如果我还没有正确的理解,就可以*的去纠正,否则,最终会发生改变。它需要以考虑复合表达式的方式进行;例如,DT(可乐= =“foo”& colB !="bar"]应该排除可乐中含有NA的行,但包含可乐不是NA但含有NA的行。同样,DT(可乐!=colB)应该包括可乐或colB都是NA但不是两者都是NA的行。也许DT[colA= colB]应该包含可乐和colB都是NA的行(我认为目前没有)。