如何从R中的数据框组成变量？

Dear Friends I would appreciate if someone can help me in some question in R. I have a data frame with 8 variables, lets say (v1,v2,...,v8).I would like to produce groups of datasets based on all possible combinations of these variables. that is, with a set of 8 variables I am able to produce 2^8-1=63 subsets of variables like {v1},{v2},...,{v8}, {v1,v2},....,{v1,v2,v3},....,{v1,v2,...,v8} my goal is to produce specific statistic based on these groupings and then compare which subset produces a better statistic. my problem is how can I produce these combinations. thanks in advance

亲爱的朋友如果有人可以帮我解决R中的一些问题,我将不胜感激。我有一个包含8个变量的数据框,让我们说(v1,v2,...,v8)。我想基于所有数据集生成数据集这些变量的可能组合。也就是说,通过一组8个变量,我能够生成2 ^ 8-1 = 63个变量子集,如{v1},{v2},...,{v8},{v1,v2},... 。,{v1,v2,v3},....,{v1,v2,...,v8}我的目标是根据这些分组生成特定的统计量,然后比较哪个子集产生更好的统计量。我的问题是如何制作这些组合。提前致谢

2 个解决方案

#1

You need the function combn. It creates all the combinations of a vector that you provide it. For instance, in your example:

你需要功能组合。它会创建您提供的矢量的所有组合。例如,在您的示例中:

names(yourdataframe) <- c("V1","V2","V3","V4","V5","V6","V7","V8")
varnames <- names(yourdataframe)
combn(x = varnames,m = 3)

This gives you all permutations of V1-V8 taken 3 at a time.

这为您提供了一次3个V1-V8的所有排列。

#2

I'll use data.table instead of data.frame;

我将使用data.table而不是data.frame;

I'll include an extraneous variable for robustness.

我将包含一个无关的变量来提高稳健性。

This will get you your subsetted data frames:

这将为您提供子集化数据框:

nn<-8L

dt<-setnames(as.data.table(cbind(1:100,matrix(rnorm(100*nn),ncol=nn))),
     c("id",paste0("V",1:nn)))

#should be a smarter (read: more easily generalized) way to produce this,
# but it's eluding me for now...
#basically, this generates the indices to include when subsetting
x<-cbind(rep(c(0,1),each=128),
         rep(rep(c(0,1),each=64),2),
         rep(rep(c(0,1),each=32),4),
         rep(rep(c(0,1),each=16),8),
         rep(rep(c(0,1),each=8),16),
         rep(rep(c(0,1),each=4),32),
         rep(rep(c(0,1),each=2),64),
         rep(c(0,1),128)) * 
  t(matrix(rep(1:nn),2^nn,nrow=nn))

#now get the correct column names for each subset
# by subscripting the nonzero elements
incl<-lapply(1:(2^nn),function(y){paste0("V",1:nn)[x[y,][x[y,]!=0]]})

#now subset the data.table for each subset
ans<-lapply(1:(2^nn),function(y){dt[,incl[[y]],with=F]})

You said you wanted some statistics from each subset, in which case it may be more useful to instead specify the last line as:

你说你想从每个子集中获得一些统计数据,在这种情况下,将最后一行指定为:

ans2<-lapply(1:(2^nn),function(y){unlist(dt[,incl[[y]],with=F])})

#exclude the first row, which is null
means<-lapply(2:(2^nn),function(y){mean(ans2[[y]])})

#1