函数列表的数据帧,返回的数据帧在R。

时间:2022-11-16 18:35:05

I have a list of data frames of the following form:

我有以下表格的数据框列表:

str(mylist)
List of 2
 $ df1:'data.frame':    50 obs. of  4 variables:
  ..$ var1: num [1:50] 0.114 0.622 0.609 0.623 0.861 ...
  ..$ var2: num [1:50] -1.221 1.819 0.195 1.232 0.786 ...
  ..$ var3: num [1:50] -0.14 -1.003 -0.352 0.647 0.424 ...
  ..$ Y   : num [1:50] -1.24 1.38 0.3 2.44 2.09 ...
 $ df2:'data.frame':    50 obs. of  4 variables:
  ..$ var1: num [1:50] 0.114 0.622 0.609 0.623 0.861 ...
  ..$ var2: num [1:50] -1.221 1.819 0.195 1.232 0.786 ...
  ..$ var3: num [1:50] -0.14 -1.003 -0.352 0.647 0.424 ...
  ..$ Y   : num [1:50] -1.24 1.38 0.3 2.44 2.09 ...
 - attr(*, "class")= chr [1:2] "mi" "list"

I am trying to return the means of the data frames in the list corresponding to the correct variable, also as a data frame, to look like:

我试图返回列表中与正确变量对应的数据帧的均值,也作为数据帧,使其看起来如下:

> str(dfnew)
'data.frame':   50 obs. of  4 variables:
 $ var1: num  0.114 0.622 0.609 0.623 0.861 ...
 $ var2: num  -1.221 1.819 0.195 1.232 0.786 ...
 $ var3: num  -0.14 -1.003 -0.352 0.647 0.424 ...
 $ Y   : num  -1.24 1.38 0.3 2.44 2.09 ...

So, something that does...

所以,东西确实……

dfnew[1,1] <- mean(mylist[[1]]$var1[1], mylist[[2]]$var1[1], na.rm=T)
dfnew[2,1] <- mean(mylist[[1]]$var1[2], mylist[[2]]$var1[2], na.rm=T)
...
dfnew[50,1] <- mean(mylist[[1]]$var1[50], mylist[[2]]$var1[50], na.rm=T)
...
dfnew[1,2] <- mean(mylist[[1]]$var2[1], mylist[[2]]$var2[1], na.rm=T)
...
dfnew[50,4] <- mean(mylist[[1]]$var4[50], mylist[[2]]$var4[50], na.rm=T)

I can see how I would do this with a for loop...

我可以看到我怎么用for循环来做这个…

...or by creating data frames of each variable,

…或者通过创建每个变量的数据帧,

var1df <- cbind(df1$var1, df2$var1)
var2df <- cbind(df1$var2, df2$var2) # and if there are up to var1000?...
...
dfnew$var1 <- rowMeans(var1df)
dfnew$var2 <- rowMeans(var2df)
...

but that's more copying than I'd like and seems less than idiomatic R; so I'm trying to do it with one of the apply functions.

但这比我想要的要多,看起来也比惯用的R要少;我试着用一个应用函数来做。

Since this is a list, lapply seemed right, except that it seems to go across the wrong margin---that is, it's mean-ing within the list, rather than the mean across the lists.

由于这是一个列表,lapply似乎是对的,但它似乎越过了错误的边距——也就是说,它在列表中是表示,而不是列表中的平均值。

> lapply(mylist, FUN=mean)
$df1
[1] NA

$df2
[1] NA

Warning messages:
1: In mean.default(X[[1L]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(X[[2L]], ...) :
  argument is not numeric or logical: returning NA

There's no setting in lapply for the other margin, cross-list rather than in-list.

在lapply中没有设置其他边距,交叉列表而不是内列表。

And regular apply, which lets me set a margin is upset that this is a list, rather than a matrix or data frame.

正则应用,它让我设置了一个边距这是一个列表,而不是一个矩阵或数据框架。

> apply(mylist, MARGIN = 2, FUN=mean)
Error in apply(mylist, MARGIN = 2, FUN = mean) : 
  dim(X) must have a positive length

(My actual list has a lot more than 2 data frames, so a lot of the easier loopy or merge-y solutions get kind of hairy pretty quickly---or at least I'm too clumsy with the loop over getattribute stuff to know how to do it cleanly for length N.)

(我的实际列表有超过2个数据帧,所以很多简单的loopy或合并的解决方案都会很快变得有点麻烦——或者至少我对getattribute之类的东西的循环太笨了,不知道如何用长度n来简洁地处理)

Is there something I'm missing in one of the rapply, tapply, eapply, *apply functions that would solve this, or something in general I'm being dumb about?

在rapply, tapply, eapply, *apply函数中有什么我没注意到的吗?

UPDATE

更新

Thanks everyone for the helpful answers. I ran across this problem when I was testing out the Amelia libraries for multiple imputation and wanted to look at what the spread of the moments of the simulations were from the long-term means. (The object they return is shaped like this, and has the properties described above of corresponding to the original data frame, and with no missing data.)

谢谢大家的帮助。我在测试阿米莉亚库进行多重假设时遇到了这个问题我想看看模拟时刻的传播是如何从长期的角度进行的。(它们返回的对象的形状是这样的,并且具有上面描述的与原始数据帧对应的属性,并且没有丢失数据。)

Here's a gist I put together fiddling with it.

这里有一个要点,我把它拼凑起来。

I like user20650's answer did not require additional copying (imputer2 in the gist), so when I started expanding onto a list of 1000, it became significantly faster than the ones that required merging new data frames.

我喜欢user20650的答案不需要额外的复制(在主旨中为imputer2),所以当我开始扩展到1000的列表时,它比需要合并新数据帧的那些要快得多。

What was kind of quirky and I haven't entirely resolved are that I was that running imputer1 versus imputer2 was producing values that looked identical, but for which a == b were false. I assume a round-off issue.

有点奇怪的是,我还没有完全确定的是,运行imputer1和imputer2产生的值看起来是一样的,但a = b是假的。我认为这是一个结论。

I'm also still looking for a way to apply general functions like mean or sd over this construct (without copying) rather than computing them itemwise, but anyway my problem is solved and I'll leave that to another question.

我还在寻找一种方法来在这个构造上应用一般的函数,比如mean或sd,而不是逐项计算它们,但是不管怎样,我的问题已经解决了,我将把它留给另一个问题。

6 个解决方案

#1


2  

# data
l <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,1:5], df3 = mtcars[1:5,1:5])

# note you can just add dataframes eg
o1 <- (l[[1]] + l[[2]] + l[[3]])/3

# So if you have many df in list - to get the average by summing and dividing by list length
f <- function(x) Reduce("+", x)
o2 <- f(l)/length(l)

all.equal(o1,o2)

#2


2  

Yet another option, which converts the list l to an array a (using an approach suggested here) and applies mean over the first two dimensions. This assumes all data frames in l have consistent structure. Here I again use @user20650's example list.

另一个选项,它将列表l转换为数组a(使用这里建议的方法),并在前两个维度上应用均值。这假设所有的数据帧都有一致的结构。这里我再次使用@user20650的示例列表。

l <- list(df1=mtcars[1:5, 1:5], df2=mtcars[1:5, 1:5], df3=mtcars[1:5, 1:5])
a <- array(unlist(l), dim=c(nrow(l[[1]]), ncol(l[[1]]), length(l)), 
           dimnames=c(dimnames(l[[1]]), list(names(l))))
apply(a, 1:2, mean)

                   mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15

#3


1  

Try to merge and then calculate your means:

尝试合并,然后计算你的方法:

df <- Reduce(rbind, lapply(mylist, function(df) {
  df$id <- seq_len(nrow(df))
  df
}))
df <- aggregate(. ~ id, df, mean)[, -1]

Example

mylist <- lapply(seq_len(3), function(x) iris[, 1:4] + runif(1, 0, 1))
sapply(seq_len(3), function(i) mylist[[i]][1,1])
# [1] 5.368424 6.097071 5.681132
# Apply above code
head(df)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     5.715542    4.115542     2.015542   0.8155424
# 2     5.515542    3.615542     2.015542   0.8155424
# 3     5.315542    3.815542     1.915542   0.8155424
# 4     5.215542    3.715542     2.115542   0.8155424
# 5     5.615542    4.215542     2.015542   0.8155424
# 6     6.015542    4.515542     2.315542   1.0155424

Note that mean(c(5.368424, 6.097071, 5.681132)) = 5.715542).

注意平均值(c(5.368424, 6.097071, 5.681132) = 5.715542)。

#4


1  

Here is an option with mapply:

mapply有一个选项:

as.data.frame(mapply(function(a, b) (a + b) / 2, df.lst[[1]], df.lst[[2]]))

This will work for any number of columns. mapply will cycle through each column from each data frame pairwise.

这适用于任意数量的列。mapply将成对地遍历每个数据帧中的每一列。

Here is the data we used:

以下是我们使用的数据:

df.lst <- replicate(2, data.frame(var1=runif(10), var2=sample(1:10)), simplify=F)

#5


1  

(i think) Previous answers will fail (certainly my previous does) if some of the variables are different in each of the dataframes or if they are in a different order. A rather horrible function below but it seems to work.

(我认为)如果每个dataframes中的一些变量不同,或者它们的顺序不同,那么以前的答案将会失败(当然我以前的答案也是如此)。下面是一个非常可怕的函数,但它似乎可以工作。

l <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,1:5], df3 = mtcars[1:5,1:5])

# Allow for different variables
l2 <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,2:6], df3 = mtcars[1:5,4:7])

new.f <- function(lst) {
                l <- lst
                un.nm <- unique(unlist(lapply(l , names)))
                o <- lapply(un.nm , function(x) {
                         lapply(l , function(z) {
                               if(x %in% names(z)) z[x] else NA
                          })  
                       })
                # combine for each variable
                l <- lapply(o , function(x) do.call(cbind, x))
                mn <- lapply(l , rowMeans , na.rm=TRUE)
        names(mn) <- lapply(l ,function(i) unique(names(i)[names(i) %in% un.nm]))
               data.frame(do.call(cbind , mn))
          }


all.equal(f(l)/length(l) , new.f(l))

f(l2) # fails
# Error in Ops.data.frame(init, x[[i]]) : 
  #+ only defined for equally-sized data frames

new.f(l2)

EDIT

编辑

This example here Join matrices by both colnames and rownames in R offers a much more concise way to do this if there are different columns in each list element.

这个例子用colname和R中的行名连接矩阵,如果每个列表元素中都有不同的列,那么这个例子提供了一种更简洁的方法。

l <- lapply(l2 , function(i) as.data.frame(as.table(as.matrix(i))))
tmp <- do.call(rbind , l)
tmp <- aggregate(Freq ~ Var1 + Var2, tmp, mean)
xtabs(Freq ~ Var1 + Var2, tmp)

#6


0  

Tested with @user20650's example. The mean of two equal numbers should be the same number.

测试与@user20650的例子。两个相等数的均值应该是相同的。

 as.data.frame( setNames(
         lapply( names(mylist[[1]]), function (nm){
              rowMeans( cbind(mylist[[1]][[nm]], mylist[[2]][[nm]] ) ) }),
         names(mylist[[1]]
        ) ) )
#--------------
   mpg cyl disp  hp drat
1 21.0   6  160 110 3.90
2 21.0   6  160 110 3.90
3 22.8   4  108  93 3.85
4 21.4   6  258 110 3.08
5 18.7   8  360 175 3.15

You read R code from the inside out: For each column name we are using numeric indices to get the dataframes and character indexing to get the columns, which are then 'c-bound' together and passed to rowMeans. This list of rowMean-ed values is then given names with setNames and finally converted to a dataframe.

您可以从内部读出R代码:对于每个列名,我们使用数字索引获取数据aframes和字符索引以获取列,然后将列“c-bound”合并到rowMeans。然后,这个行平均值的值列表被命名为setNames,最后转换为dataframe。

Note that this does not get all of the dataframes in a list of more than two... only the first two are considered.

注意,这不会在一个多于两个的列表中获得所有的数据aframes……只考虑前两个。

> str(mylist)
List of 3
 $ df1:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15
 $ df2:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15
 $ df3:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15

#1


2  

# data
l <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,1:5], df3 = mtcars[1:5,1:5])

# note you can just add dataframes eg
o1 <- (l[[1]] + l[[2]] + l[[3]])/3

# So if you have many df in list - to get the average by summing and dividing by list length
f <- function(x) Reduce("+", x)
o2 <- f(l)/length(l)

all.equal(o1,o2)

#2


2  

Yet another option, which converts the list l to an array a (using an approach suggested here) and applies mean over the first two dimensions. This assumes all data frames in l have consistent structure. Here I again use @user20650's example list.

另一个选项,它将列表l转换为数组a(使用这里建议的方法),并在前两个维度上应用均值。这假设所有的数据帧都有一致的结构。这里我再次使用@user20650的示例列表。

l <- list(df1=mtcars[1:5, 1:5], df2=mtcars[1:5, 1:5], df3=mtcars[1:5, 1:5])
a <- array(unlist(l), dim=c(nrow(l[[1]]), ncol(l[[1]]), length(l)), 
           dimnames=c(dimnames(l[[1]]), list(names(l))))
apply(a, 1:2, mean)

                   mpg cyl disp  hp drat
Mazda RX4         21.0   6  160 110 3.90
Mazda RX4 Wag     21.0   6  160 110 3.90
Datsun 710        22.8   4  108  93 3.85
Hornet 4 Drive    21.4   6  258 110 3.08
Hornet Sportabout 18.7   8  360 175 3.15

#3


1  

Try to merge and then calculate your means:

尝试合并,然后计算你的方法:

df <- Reduce(rbind, lapply(mylist, function(df) {
  df$id <- seq_len(nrow(df))
  df
}))
df <- aggregate(. ~ id, df, mean)[, -1]

Example

mylist <- lapply(seq_len(3), function(x) iris[, 1:4] + runif(1, 0, 1))
sapply(seq_len(3), function(i) mylist[[i]][1,1])
# [1] 5.368424 6.097071 5.681132
# Apply above code
head(df)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width
# 1     5.715542    4.115542     2.015542   0.8155424
# 2     5.515542    3.615542     2.015542   0.8155424
# 3     5.315542    3.815542     1.915542   0.8155424
# 4     5.215542    3.715542     2.115542   0.8155424
# 5     5.615542    4.215542     2.015542   0.8155424
# 6     6.015542    4.515542     2.315542   1.0155424

Note that mean(c(5.368424, 6.097071, 5.681132)) = 5.715542).

注意平均值(c(5.368424, 6.097071, 5.681132) = 5.715542)。

#4


1  

Here is an option with mapply:

mapply有一个选项:

as.data.frame(mapply(function(a, b) (a + b) / 2, df.lst[[1]], df.lst[[2]]))

This will work for any number of columns. mapply will cycle through each column from each data frame pairwise.

这适用于任意数量的列。mapply将成对地遍历每个数据帧中的每一列。

Here is the data we used:

以下是我们使用的数据:

df.lst <- replicate(2, data.frame(var1=runif(10), var2=sample(1:10)), simplify=F)

#5


1  

(i think) Previous answers will fail (certainly my previous does) if some of the variables are different in each of the dataframes or if they are in a different order. A rather horrible function below but it seems to work.

(我认为)如果每个dataframes中的一些变量不同,或者它们的顺序不同,那么以前的答案将会失败(当然我以前的答案也是如此)。下面是一个非常可怕的函数,但它似乎可以工作。

l <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,1:5], df3 = mtcars[1:5,1:5])

# Allow for different variables
l2 <- list(df1 = mtcars[1:5,1:5] , df2 = mtcars[1:5,2:6], df3 = mtcars[1:5,4:7])

new.f <- function(lst) {
                l <- lst
                un.nm <- unique(unlist(lapply(l , names)))
                o <- lapply(un.nm , function(x) {
                         lapply(l , function(z) {
                               if(x %in% names(z)) z[x] else NA
                          })  
                       })
                # combine for each variable
                l <- lapply(o , function(x) do.call(cbind, x))
                mn <- lapply(l , rowMeans , na.rm=TRUE)
        names(mn) <- lapply(l ,function(i) unique(names(i)[names(i) %in% un.nm]))
               data.frame(do.call(cbind , mn))
          }


all.equal(f(l)/length(l) , new.f(l))

f(l2) # fails
# Error in Ops.data.frame(init, x[[i]]) : 
  #+ only defined for equally-sized data frames

new.f(l2)

EDIT

编辑

This example here Join matrices by both colnames and rownames in R offers a much more concise way to do this if there are different columns in each list element.

这个例子用colname和R中的行名连接矩阵,如果每个列表元素中都有不同的列,那么这个例子提供了一种更简洁的方法。

l <- lapply(l2 , function(i) as.data.frame(as.table(as.matrix(i))))
tmp <- do.call(rbind , l)
tmp <- aggregate(Freq ~ Var1 + Var2, tmp, mean)
xtabs(Freq ~ Var1 + Var2, tmp)

#6


0  

Tested with @user20650's example. The mean of two equal numbers should be the same number.

测试与@user20650的例子。两个相等数的均值应该是相同的。

 as.data.frame( setNames(
         lapply( names(mylist[[1]]), function (nm){
              rowMeans( cbind(mylist[[1]][[nm]], mylist[[2]][[nm]] ) ) }),
         names(mylist[[1]]
        ) ) )
#--------------
   mpg cyl disp  hp drat
1 21.0   6  160 110 3.90
2 21.0   6  160 110 3.90
3 22.8   4  108  93 3.85
4 21.4   6  258 110 3.08
5 18.7   8  360 175 3.15

You read R code from the inside out: For each column name we are using numeric indices to get the dataframes and character indexing to get the columns, which are then 'c-bound' together and passed to rowMeans. This list of rowMean-ed values is then given names with setNames and finally converted to a dataframe.

您可以从内部读出R代码:对于每个列名,我们使用数字索引获取数据aframes和字符索引以获取列,然后将列“c-bound”合并到rowMeans。然后,这个行平均值的值列表被命名为setNames,最后转换为dataframe。

Note that this does not get all of the dataframes in a list of more than two... only the first two are considered.

注意,这不会在一个多于两个的列表中获得所有的数据aframes……只考虑前两个。

> str(mylist)
List of 3
 $ df1:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15
 $ df2:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15
 $ df3:'data.frame':    5 obs. of  5 variables:
  ..$ mpg : num [1:5] 21 21 22.8 21.4 18.7
  ..$ cyl : num [1:5] 6 6 4 6 8
  ..$ disp: num [1:5] 160 160 108 258 360
  ..$ hp  : num [1:5] 110 110 93 110 175
  ..$ drat: num [1:5] 3.9 3.9 3.85 3.08 3.15