在数据表中有效地计算非na元素。

时间:2023-01-13 22:47:17

Sometimes I need to count the number of non-NA elements in one or another column in my data.table. What is the best data.table-tailored way to do so?

有时,我需要在data.table中的一个或另一个列中计算非na元素的数量。最好的数据是什么?桌子定制的方式?

For concreteness, let's work with this:

具体来说,让我们来研究一下:

DT <- data.table(id = sample(100, size = 1e6, replace = TRUE),
                 var = sample(c(1, 0, NA), size = 1e6, replace = TRUE), key = "id")

The first thing that comes to my mind works like this:

我想到的第一件事是这样的:

DT[!is.na(var), N := .N, by = id]

But this has the unfortunate shortcoming that N does not get assigned to any row where var is missing, i.e. DT[is.na(var), N] = NA.

但是,这有一个不幸的缺点,即N不被分配到任何一个缺少var的行,即DT[is.na(var), N] = NA。

So I work around this by appending:

所以我在这里加上:

DT[!is.na(var), N:= .N, by = id][ , N := max(N, na.rm = TRUE), by = id] #OPTION 1

However, I'm not sure this is the best approach; another option I thought of and one suggested by the analog to this question for data.frames would be:

然而,我不确定这是最好的方法;我想到的另一个选择和这个问题的类比为data.frame的一个建议是:

DT[ , N := length(var[!is.na(var)]), by = id] # OPTION 2

and

DT[ , N := sum(!is.na(var)), by = id] # OPTION 3

Comparing computation time of these (average over 100 trials), the last seems to be the fastest:

比较这些实验的计算时间(平均超过100次试验),最后一个似乎是最快的:

OPTION 1 | OPTION 2 | OPTION 3
  .075   |   .065   |   .043

Does anyone know a speedier way for data.table?

有人知道更快的数据传输方式吗?

1 个解决方案

#1


7  

Yes the option 3rd seems to be the best one. I've added another one which is valid only if you consider to change the key of your data.table from id to var, but still option 3 is the fastest on your data.

是的,第三种选择似乎是最好的。我添加了另一个只有当您考虑更改数据的键时才有效。从id到var的表,但是选项3仍然是数据中最快的。

library(microbenchmark)
library(data.table)

dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key=c("var"))

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)

microbenchmark(times=10L,
               dt1[!is.na(var),.N,by=id][,max(N,na.rm=T),by=id],
               dt2[,length(var[!is.na(var)]),by=id],
               dt3[,sum(!is.na(var)),by=id],
               dt4[.(c(1,0)),.N,id,nomatch=0L])
# Unit: milliseconds
#                                                         expr      min       lq      mean    median        uq       max neval
#  dt1[!is.na(var), .N, by = id][, max(N, na.rm = T), by = id] 95.14981 95.79291 105.18515 100.16742 112.02088 131.87403    10
#                     dt2[, length(var[!is.na(var)]), by = id] 83.17203 85.91365  88.54663  86.93693  89.56223 100.57788    10
#                             dt3[, sum(!is.na(var)), by = id] 45.99405 47.81774  50.65637  49.60966  51.77160  61.92701    10
#                        dt4[.(c(1, 0)), .N, id, nomatch = 0L] 78.50544 80.95087  89.09415  89.47084  96.22914 100.55434    10

#1


7  

Yes the option 3rd seems to be the best one. I've added another one which is valid only if you consider to change the key of your data.table from id to var, but still option 3 is the fastest on your data.

是的,第三种选择似乎是最好的。我添加了另一个只有当您考虑更改数据的键时才有效。从id到var的表,但是选项3仍然是数据中最快的。

library(microbenchmark)
library(data.table)

dt<-data.table(id=(1:100)[sample(10,size=1e6,replace=T)],var=c(1,0,NA)[sample(3,size=1e6,replace=T)],key=c("var"))

dt1 <- copy(dt)
dt2 <- copy(dt)
dt3 <- copy(dt)
dt4 <- copy(dt)

microbenchmark(times=10L,
               dt1[!is.na(var),.N,by=id][,max(N,na.rm=T),by=id],
               dt2[,length(var[!is.na(var)]),by=id],
               dt3[,sum(!is.na(var)),by=id],
               dt4[.(c(1,0)),.N,id,nomatch=0L])
# Unit: milliseconds
#                                                         expr      min       lq      mean    median        uq       max neval
#  dt1[!is.na(var), .N, by = id][, max(N, na.rm = T), by = id] 95.14981 95.79291 105.18515 100.16742 112.02088 131.87403    10
#                     dt2[, length(var[!is.na(var)]), by = id] 83.17203 85.91365  88.54663  86.93693  89.56223 100.57788    10
#                             dt3[, sum(!is.na(var)), by = id] 45.99405 47.81774  50.65637  49.60966  51.77160  61.92701    10
#                        dt4[.(c(1, 0)), .N, id, nomatch = 0L] 78.50544 80.95087  89.09415  89.47084  96.22914 100.55434    10