最有效的子集化向量方法

时间:2021-09-22 20:13:23

I need to calculate the mean and variance of a subset of a vector. Let x be the vector and y be an indicator for whether the observation is in the subset. Which is more efficient:

我需要计算向量子集的均值和方差。令x为向量,y为观察是否在子集中的指示符。哪个更有效:

sub.mean <- mean(x[y])
sub.var  <-  var(x[y])

or

要么

sub      <- x[y]
sub.mean <- mean(sub)
sub.var  <-  var(sub)
sub      <- NULL

The first approach doesn't create a new object explicitly; but do the calls to mean and var do that implicitly? Or do they work on the original vector as stored?

第一种方法没有明确地创建新对象;但是对mean和var的调用是否隐含地执行了这个操作?或者它们是否存储了原始矢量?

Is the second faster because it doesn't have to do the subsetting twice?

第二个更快,因为它不必进行两次子集化吗?

I'm concerned with speed and with memory management for large data sets.

我关注大数据集的速度和内存管理。

1 个解决方案

#1


7  

Benchmarking on a vector of length 10M indicates that (on my machine) the latter approach is faster:

对长度为10M的向量进行基准测试表明(在我的机器上)后一种方法更快:

f1 = function(x, y) {
    sub.mean <- mean(x[y])
    sub.var  <-  var(x[y])
}

f2 = function(x, y) {
    sub      <- x[y]
    sub.mean <- mean(sub)
    sub.var  <-  var(sub)
    sub      <- NULL
}

x = rnorm(10000000)
y = rbinom(10000000, 1, .5)

print(system.time(f1(x, y)))
#   user  system elapsed 
#  0.403   0.037   0.440 
print(system.time(f2(x, y)))
#   user  system elapsed 
#  0.233   0.002   0.235 

This isn't surprising- mean(x[y]) does have to create a new object for the mean function to act on, even if it doesn't add it to the local namespace. Thus, f1 is slower for having to do the subsetting twice (as you surmised).

这并不奇怪 - 意味着(x [y])必须为平均函数创建一个新对象,即使它没有将它添加到本地命名空间。因此,f1对于必须进行两次子集化的速度较慢(正如您所推测的那样)。

#1


7  

Benchmarking on a vector of length 10M indicates that (on my machine) the latter approach is faster:

对长度为10M的向量进行基准测试表明(在我的机器上)后一种方法更快:

f1 = function(x, y) {
    sub.mean <- mean(x[y])
    sub.var  <-  var(x[y])
}

f2 = function(x, y) {
    sub      <- x[y]
    sub.mean <- mean(sub)
    sub.var  <-  var(sub)
    sub      <- NULL
}

x = rnorm(10000000)
y = rbinom(10000000, 1, .5)

print(system.time(f1(x, y)))
#   user  system elapsed 
#  0.403   0.037   0.440 
print(system.time(f2(x, y)))
#   user  system elapsed 
#  0.233   0.002   0.235 

This isn't surprising- mean(x[y]) does have to create a new object for the mean function to act on, even if it doesn't add it to the local namespace. Thus, f1 is slower for having to do the subsetting twice (as you surmised).

这并不奇怪 - 意味着(x [y])必须为平均函数创建一个新对象,即使它没有将它添加到本地命名空间。因此,f1对于必须进行两次子集化的速度较慢(正如您所推测的那样)。

相关文章