将组合因子列分割为r数据中的两个因子列的最有效方法是什么?

时间:2023-01-01 22:24:03

I have a large data.table (9 M lines) with two columns: fcombined and value fcombined is a factor, but its actually the result of interacting two factors. The question now is what is the most efficient way to split up the one factor column in two again? I have already come up with a solution that works ok, but maybe there is more straight forward way that i have missed. The working example is:

我有一个大数据。表(9 M行)包含两列:fcombination和value fcombination是一个因素,但它实际上是两个因素相互作用的结果。现在的问题是,什么是最有效的方法将一个因子列分解为2 ?我已经提出了一个可行的解决方案,但也许我已经错过了更直接的方法。工作的例子是:

library(stringr)
f1=1:20
f2=1:20
g=expand.grid(f1,f2)
combinedfactor=as.factor(paste(g$Var1,g$Var2,sep="_"))
largedata=1:10^6
DT=data.table(fcombined=combinedfactor,value=largedata)


splitfactorcol=function(res,colname,splitby="_",namesofnewcols){#the nr. of cols retained is length(namesofnewcols)
  helptable=data.table(.factid=seq_along(levels(res[[colname]])) ,str_split_fixed(levels(res[[colname]]),splitby,length(namesofnewcols)))
  setnames(helptable,colnames(helptable),c(".factid",namesofnewcols))
  setkey(helptable,.factid)
  res$.factid=unclass(res[[colname]])
  setkey(res,.factid)
  m=merge(res,helptable)
  m$.factid=NULL
  m
}
splitfactorcol(DT,"fcombined",splitby="_",c("f1","f2"))

1 个解决方案

#1


3  

I think this does the trick and is about 5x faster.

我想这是一个技巧,速度大约是5x。

setkey(DT, fcombined)
DT[DT[, data.table(fcombined = levels(fcombined),
                   do.call(rbind, strsplit(levels(fcombined), "_")))]]

I split the levels and then simply merged that result back into the original data.table.

我将级别拆分,然后将结果简单地合并到原始数据表中。

Btw, in my tests strsplit was about 2x faster (for this task) than the stringr function.

顺便说一句,在我的测试中,strsplit(对于这个任务)比stringr函数快了2倍。

#1


3  

I think this does the trick and is about 5x faster.

我想这是一个技巧,速度大约是5x。

setkey(DT, fcombined)
DT[DT[, data.table(fcombined = levels(fcombined),
                   do.call(rbind, strsplit(levels(fcombined), "_")))]]

I split the levels and then simply merged that result back into the original data.table.

我将级别拆分,然后将结果简单地合并到原始数据表中。

Btw, in my tests strsplit was about 2x faster (for this task) than the stringr function.

顺便说一句,在我的测试中,strsplit(对于这个任务)比stringr函数快了2倍。