如何修改数据帧列表中的特定字段?

时间:2023-01-20 08:05:20

Suppose I write the following R code:

假设我写以下R代码:

first.value <- sample(100, 100, replace=TRUE)
second.value <- sample(10, 100, replace=TRUE)

X <- data.frame(first.value, second.value)
split.X <- split(X, second.value)

This code creates a data frame with two fields, and splits into bins according to the second. Now suppose I wanted to normalize each bin; i.e., subtract the mean and divide by the standard deviation. I could accomplish this by

这段代码创建了一个包含两个字段的数据框架,并根据第二个字段将其分解为多个容器。现在假设我想让每个箱子标准化;即。,减去均值,除以标准差。我可以通过

normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})

But this creates a new list with the normalized versions of each bin. What I really want to do is replace the copy of the data in split.X with its normalized version.

但是这会创建一个包含每个bin的规范化版本的新列表。我真正想做的是在split中替换数据的拷贝。X的标准化版本。

To illustrate, here's some sample output:

为了说明这一点,这里有一些示例输出:

> first.value <- sample(100, 100, replace=TRUE)
> second.value <- sample(10, 100, replace=TRUE)
> X <- data.frame(first.value, second.value)
> split.X <- split(X, second.value)
> normalized.first.value <- sapply(split.X, function(X) {(X$first.value - mean(X$first.value)) / sd(X$first.value)})
> split.X[[1]]
   first.value second.value
4           34            1
8           40            1
24          21            1
31          34            1
37          23            1
40          22            1
> normalized.first.value[[1]]
[1]  0.625  1.375 -1.000  0.625 -0.750 -0.875

What I really want to do is to put the values of normalized.first.value[[1]] into split.X[[1]]$first.value, and the same for the other indices.

我真正想做的是把这些值标准化。值[[1]]到split.X美元[[1]]。值,其他指标也是一样。

This could be achieved with a for loop as follows:

这可以通过以下for循环实现:

for (i in 1:length(split.X)) {
  split.X[[i]]$first.value <- (split.X[[i]]$first.value - mean(split.X[[i]]$first.value) / sd(split.X[[i]]$first.value);
}

But for loops are BAD in R, and I'd like to use sapply,lapply, etc. if I can. Unfortunately, when dealing with a list of dataframes, sapply and lapply don't seem to iterate in the way I want.

但是for循环在R中是不好的,我想用sapply,lapply等等。不幸的是,在处理dataframes列表时,sapply和lapply似乎并没有按照我希望的方式进行迭代。

2 个解决方案

#1


1  

You can use Map as both the lists have the same length. It works by replacing the first column in 'split.X' by the corresponding the list element in 'normalized.first.value'

可以使用Map,因为两个列表的长度相同。它通过替换“split”中的第一列来工作。通过“normalized.first.value”中相应的列表元素

  Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)

Or we can loop through the length of 'split.X', get the list elements of the 'split.X' and 'normalized.first.value' based on the index and then replace.

或者我们可以对“分裂”的长度进行循环。X,获取“split”的列表元素。X”和“normalized.first。值'基于索引,然后替换。

  lapply(seq_along(split.X), function(i) {
             x1 <- split.X[[i]]
             x1[,'first.value'] <- normalized.first.value[[i]]
             x1})

#2


2  

Here's a more arcane way (though I still reckon the for loop is fine in this case)

这里有一个更神秘的方法(尽管我仍然认为for循环在这个例子中是好的)

new.split.X <- mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
                      SIMPLIFY=F) 

How it works: applies [<- on each split.X[[i]]. The T is the i index to replace (i.e. all of them), 'first.value' is the j index to replace (that column), normalized.first.value contains the replacements.

工作原理:应用[<-对每个分割。x [i]]。T是要替换的i指数(也就是所有的)'first。value'是要替换(该列)的j索引,是normalize .first。价值包含了替代品。

The loop may be easier to read in the end though, and probably not slower than tricksy *apply solutions.

不过,循环在最后可能更容易阅读,而且可能不会比花哨的应用解决方案慢。

library(rbenchmark)
benchmark(loop={
    for (i in 1:length(split.X))
        split.X[[i]]$first.value <- normalized.first.value[[i]]
  },
  mapply={
    mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
                          SIMPLIFY=F)
  },
  Map={
    Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
  },
  lapply={
    lapply(seq_along(split.X), function(i) {
             x1 <- split.X[[i]]
             x1[,'first.value'] <- normalized.first.value[[i]]
             x1})
  })
    test replications elapsed relative user.self sys.self user.child sys.child
4 lapply          100   0.034    4.857     0.035        0          0         0
1   loop          100   0.007    1.000     0.007        0          0         0
3    Map          100   0.012    1.714     0.013        0          0         0
2 mapply          100   0.030    4.286     0.032        0          0         0

So the explicit loop is the fastest, and easieset to read anyway.

所以显式循环是最快的,而且是容易阅读的。

#1


1  

You can use Map as both the lists have the same length. It works by replacing the first column in 'split.X' by the corresponding the list element in 'normalized.first.value'

可以使用Map,因为两个列表的长度相同。它通过替换“split”中的第一列来工作。通过“normalized.first.value”中相应的列表元素

  Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)

Or we can loop through the length of 'split.X', get the list elements of the 'split.X' and 'normalized.first.value' based on the index and then replace.

或者我们可以对“分裂”的长度进行循环。X,获取“split”的列表元素。X”和“normalized.first。值'基于索引,然后替换。

  lapply(seq_along(split.X), function(i) {
             x1 <- split.X[[i]]
             x1[,'first.value'] <- normalized.first.value[[i]]
             x1})

#2


2  

Here's a more arcane way (though I still reckon the for loop is fine in this case)

这里有一个更神秘的方法(尽管我仍然认为for循环在这个例子中是好的)

new.split.X <- mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
                      SIMPLIFY=F) 

How it works: applies [<- on each split.X[[i]]. The T is the i index to replace (i.e. all of them), 'first.value' is the j index to replace (that column), normalized.first.value contains the replacements.

工作原理:应用[<-对每个分割。x [i]]。T是要替换的i指数(也就是所有的)'first。value'是要替换(该列)的j索引,是normalize .first。价值包含了替代品。

The loop may be easier to read in the end though, and probably not slower than tricksy *apply solutions.

不过,循环在最后可能更容易阅读,而且可能不会比花哨的应用解决方案慢。

library(rbenchmark)
benchmark(loop={
    for (i in 1:length(split.X))
        split.X[[i]]$first.value <- normalized.first.value[[i]]
  },
  mapply={
    mapply(`[<-`, split.X, T, 'first.value', normalized.first.value,
                          SIMPLIFY=F)
  },
  Map={
    Map(function(x,y) {x[['first.value']] <- y;x} ,split.X, normalized.first.value)
  },
  lapply={
    lapply(seq_along(split.X), function(i) {
             x1 <- split.X[[i]]
             x1[,'first.value'] <- normalized.first.value[[i]]
             x1})
  })
    test replications elapsed relative user.self sys.self user.child sys.child
4 lapply          100   0.034    4.857     0.035        0          0         0
1   loop          100   0.007    1.000     0.007        0          0         0
3    Map          100   0.012    1.714     0.013        0          0         0
2 mapply          100   0.030    4.286     0.032        0          0         0

So the explicit loop is the fastest, and easieset to read anyway.

所以显式循环是最快的,而且是容易阅读的。