在R中的bigmemory中按组减去平均值

时间:2023-01-02 22:01:08

I would like to demean the variables from the big.matrix (panel) structure. I tried different methods but the one which works in bigmemory setting is tapply (provided by bigtabulate package). I have the following code to calculate means of variable var1 by groups represented by panel_id

我想贬低big.matrix(面板)结构中的变量。我尝试了不同的方法,但是在bigmemory设置中工作的方法是tapply(由bigtabulate包提供)。我有以下代码来计算由panel_id表示的组的变量var1的平均值

data <- read.big.matrix ("data.csv", sep = ",", header=TRUE, type = "double", backingfile = "backing.bin" , descriptor = "data.desc")
xdesc <- dget ("data.desc")
data <- attach.big.matrix(xdesc)

mean_var1=tapply(data[,"var1"], data[,"panel_id"], mean, na.rm=TRUE)

Since the length of var1 is different from the one of mean_var1, I cannot simply subtract one from another to demean the variables. Do you have any ideas how to subtract from each observation of var1 its group mean?

由于var1的长度与mean_var1的长度不同,因此我不能简单地从另一个中减去一个来贬低变量。你有什么想法如何从var1的每个观察中减去它的组意味着什么?

1 个解决方案

#1


The simplest approach would probably be to use the bigsplit function and a for loop for in-place modification.

最简单的方法可能是使用bigsplit函数和for循环进行就地修改。

idx <- bigsplit(data, 1)

for(i in seq(length(idx))){
    data[idx[[i]],2] <- data[idx[[i]],2] - mean_var1[i]
}

It appears like you will want the former but if you wanted a subset returned of a reasonable size (i.e. not exceedingly RAM) then you could use lapply or even explore some parallelization with foreach

看起来你会想要前者但是如果你想要一个合理大小的返回子集(即不是非常大的RAM)那么你可以使用lapply甚至探索一些与foreach的并行化

# use lapply
lapply(seq(length(idx)), function(x) data[idx[[x]],] - mean_var1[[x]])

# use foreach (don't forget to register you backend!!!)
library(foreach)
foreach(iter = seq(length(idx))) %dopar% {
    data[idx[[iter]],2] - mean_var1[iter]
}

#1


The simplest approach would probably be to use the bigsplit function and a for loop for in-place modification.

最简单的方法可能是使用bigsplit函数和for循环进行就地修改。

idx <- bigsplit(data, 1)

for(i in seq(length(idx))){
    data[idx[[i]],2] <- data[idx[[i]],2] - mean_var1[i]
}

It appears like you will want the former but if you wanted a subset returned of a reasonable size (i.e. not exceedingly RAM) then you could use lapply or even explore some parallelization with foreach

看起来你会想要前者但是如果你想要一个合理大小的返回子集(即不是非常大的RAM)那么你可以使用lapply甚至探索一些与foreach的并行化

# use lapply
lapply(seq(length(idx)), function(x) data[idx[[x]],] - mean_var1[[x]])

# use foreach (don't forget to register you backend!!!)
library(foreach)
foreach(iter = seq(length(idx))) %dopar% {
    data[idx[[iter]],2] - mean_var1[iter]
}