“应用”功能有哪些优点?它们什么时候比“for”循环更好用,什么时候不用? [重复]

时间:2022-11-11 18:31:32

Possible Duplicate:
Is R's apply family more than syntactic sugar

可能重复:R是否适用于家庭而不是语法糖

Just what the title says. Stupid question, perhaps, but my understanding has been that when using an "apply" function, the iteration is performed in compiled code rather than in the R parser. This would seem to imply that lapply, for instance, is only faster than a "for" loop if there are a great many iterations and each operation is relatively simple. For instance, if a single call to a function wrapped up in lapply takes 10 seconds, and there are only, say, 12 iterations of it, I would imagine that there's virtually no difference at all between using "for" and "lapply".

正如标题所说的那样。也许是个愚蠢的问题,但我的理解是,当使用“apply”函数时,迭代是在编译的代码中而不是在R解析器中执行的。例如,如果存在大量迭代并且每个操作相对简单,那么这似乎意味着lapply仅比“for”循环更快。例如,如果对lapply中包含的函数的单个调用需要10秒,并且只有12次迭代,我会想到使用“for”和“lapply”之间几乎没有任何区别。

Now that I think of it, if the function inside the "lapply" has to be parsed anyway, why should there be ANY performance benefit from using "lapply" instead of "for" unless you're doing something that there are compiled functions for (like summing or multiplying, etc)?

既然我想到了,如果必须解析“lapply”中的函数,为什么使用“lapply”而不是“for”会有任何性能上的好处,除非你做的事情是编译函数为(如总结或乘法等)?

Thanks in advance!

提前致谢!

Josh

2 个解决方案

#1


12  

There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.

有几个原因可能为什么人们可能更喜欢应用族函数而不是for循环,反之亦然。

Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.

首先,对于()和apply(),如果正确执行,sapply()通常会和对方一样快。 lapply()在R内部的编译代码中运行的比其他代码更多,因此可以比这些函数更快。当“循环”数据的行为是计算时间的重要部分时,速度优势似乎最大;在许多日常使用中,你不太可能从固有的更快的lapply()获得更多。最后,这些都将调用R函数,因此需要对它们进行解释然后运行。

for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:

for()循环通常更容易实现,特别是如果你来自循环很普遍的编程背景。在循环中工作可能比将迭代计算强制应用于一个应用族函数更自然。但是,要正确使用for()循环,您需要做一些额外的工作来设置存储并管理将循环的输出再次插回到一起。 apply函数自动为您执行此操作。例如。:

IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
    OUT[i] <- IN > 0.5
}

that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.

这是一个愚蠢的例子,因为>是一个矢量化运算符,但我想要点什么,即你必须管理输出。主要的是,使用for()循环,您始终会在启动循环之前分配足够的存储空间来保存输出。如果您不知道需要多少存储空间,那么请分配一个合理的存储空间,然后在循环中检查您是否已经耗尽了该存储空间,并锁定了另一大块存储空间。

The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!

在我看来,使用其中一个函数系列函数的主要原因是更优雅,可读的代码。我没有管理输出存储和设置循环(如上所示),而是让R处理它并简洁地要求R在我们数据的子集上运行一个函数。至少对我来说,速度通常不会做出决定。我使用最适合这种情况的功能,并且会产生简单易懂的代码,因为如果我不记得代码是什么,我总是选择最快的功能,因此我更有可能浪费更多的时间。做一天或一周或更长时间!

The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.

申请系列适用于标量或矢量运算。 for()循环通常会使用相同的索引i进行多次迭代操作。例如,我编写的代码使用for()循环来对对象进行k-fold或bootstrap交叉验证。由于每个CV迭代需要多个操作,访问当前帧中的大量对象,并填充包含迭代输出的多个输出对象,因此我可能永远不会接受应用系列之一。

As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:

至于最后一点,关于为什么lapply()可能比for()或apply()更快,你需要意识到“循环”可以在解释的R代码或编译的代码中执行。是的,两者仍然会调用需要解释的R函数,但如果你正在进行循环并直接从编译的C代码调用(例如lapply())那么性能增益可以来自apply()归结为实际R代码中的for()循环。查看apply()的源代码,看看它是for()循环的包装器,然后查看lapply()的代码,它是:

> lapply
function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<environment: namespace:base>

and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().

你应该明白为什么lapply()和for()以及其他应用族函数之间的速度可能存在差异。 .Internal()是R调用R本身使用的编译C代码的方法之一。除了操作和FUN的健全性检查之外,整个计算在C中完成,调用R函数FUN。将其与apply()的源进行比较。

#2


3  

From Burns' R Inferno (pdf), p25:

来自Burns'R Inferno(pdf),第25页:

Use an explicit for loop when each iteration is a non-trivial task. But a simple loop can be more clearly and compactly expressed using an apply function. There is at least one exception to this rule ... if the result will be a list and some of the components can be NULL, then a for loop is trouble (big trouble) and lapply gives the expected answer.

当每次迭代都是一个非平凡的任务时,使用显式的for循环。但是使用apply函数可以更清晰,更紧凑地表达简单的循环。这个规则至少有一个例外...如果结果是一个列表而某些组件可能是NULL,那么for循环就会出现问题(大麻烦),而lapply给出了预期的答案。

#1


12  

There are several reasons why one might prefer an apply family function over a for loop, or vice-versa.

有几个原因可能为什么人们可能更喜欢应用族函数而不是for循环,反之亦然。

Firstly, for() and apply(), sapply() will generally be just as quick as each other if executed correctly. lapply() does more of it's operating in compiled code within the R internals than the others, so can be faster than those functions. It appears the speed advantage is greatest when the act of "looping" over the data is a significant part of the compute time; in many general day-to-day uses you are unlikely to gain much from the inherently quicker lapply(). In the end, these all will be calling R functions so they need to be interpreted and then run.

首先,对于()和apply(),如果正确执行,sapply()通常会和对方一样快。 lapply()在R内部的编译代码中运行的比其他代码更多,因此可以比这些函数更快。当“循环”数据的行为是计算时间的重要部分时,速度优势似乎最大;在许多日常使用中,你不太可能从固有的更快的lapply()获得更多。最后,这些都将调用R函数,因此需要对它们进行解释然后运行。

for() loops can often be easier to implement, especially if you come from a programming background where loops are prevalent. Working in a loop may be more natural than forcing the iterative computation into one of the apply family functions. However, to use for() loops properly, you need to do some extra work to set-up storage and manage plugging the output of the loop back together again. The apply functions do this for you automagically. E.g.:

for()循环通常更容易实现,特别是如果你来自循环很普遍的编程背景。在循环中工作可能比将迭代计算强制应用于一个应用族函数更自然。但是,要正确使用for()循环,您需要做一些额外的工作来设置存储并管理将循环的输出再次插回到一起。 apply函数自动为您执行此操作。例如。:

IN <- runif(10)
OUT <- logical(length = length(IN))
for(i in IN) {
    OUT[i] <- IN > 0.5
}

that is a silly example as > is a vectorised operator but I wanted something to make a point, namely that you have to manage the output. The main thing is that with for() loops, you always allocate sufficient storage to hold the outputs before you start the loop. If you don't know how much storage you will need, then allocate a reasonable chunk of storage, and then in the loop check if you have exhausted that storage, and bolt on another big chunk of storage.

这是一个愚蠢的例子,因为>是一个矢量化运算符,但我想要点什么,即你必须管理输出。主要的是,使用for()循环,您始终会在启动循环之前分配足够的存储空间来保存输出。如果您不知道需要多少存储空间,那么请分配一个合理的存储空间,然后在循环中检查您是否已经耗尽了该存储空间,并锁定了另一大块存储空间。

The main reason, in my mind, for using one of the apply family of functions is for more elegant, readable code. Rather than managing the output storage and setting up the loop (as shown above) we can let R handle that and succinctly ask R to run a function on subsets of our data. Speed usually does not enter into the decision, for me at least. I use the function that suits the situation best and will result in simple, easy to understand code, because I'm far more likely to waste more time than I save by always choosing the fastest function if I can't remember what the code is doing a day or a week or more later!

在我看来,使用其中一个函数系列函数的主要原因是更优雅,可读的代码。我没有管理输出存储和设置循环(如上所示),而是让R处理它并简洁地要求R在我们数据的子集上运行一个函数。至少对我来说,速度通常不会做出决定。我使用最适合这种情况的功能,并且会产生简单易懂的代码,因为如果我不记得代码是什么,我总是选择最快的功能,因此我更有可能浪费更多的时间。做一天或一周或更长时间!

The apply family lend themselves to scalar or vector operations. A for() loop will often lend itself to doing multiple iterated operations using the same index i. For example, I have written code that uses for() loops to do k-fold or bootstrap cross-validation on objects. I probably would never entertain doing that with one of the apply family as each CV iteration needs multiple operations, access to lots of objects in the current frame, and fills in several output objects that hold the output of the iterations.

申请系列适用于标量或矢量运算。 for()循环通常会使用相同的索引i进行多次迭代操作。例如,我编写的代码使用for()循环来对对象进行k-fold或bootstrap交叉验证。由于每个CV迭代需要多个操作,访问当前帧中的大量对象,并填充包含迭代输出的多个输出对象,因此我可能永远不会接受应用系列之一。

As to the last point, about why lapply() can possibly be faster that for() or apply(), you need to realise that the "loop" can be performed in interpreted R code or in compiled code. Yes, both will still be calling R functions that need to be interpreted, but if you are doing the looping and calling directly from compiled C code (e.g. lapply()) then that is where the performance gain can come from over apply() say which boils down to a for() loop in actual R code. See the source for apply() to see that it is a wrapper around a for() loop, and then look at the code for lapply(), which is:

至于最后一点,关于为什么lapply()可能比for()或apply()更快,你需要意识到“循环”可以在解释的R代码或编译的代码中执行。是的,两者仍然会调用需要解释的R函数,但如果你正在进行循环并直接从编译的C代码调用(例如lapply())那么性能增益可以来自apply()归结为实际R代码中的for()循环。查看apply()的源代码,看看它是for()循环的包装器,然后查看lapply()的代码,它是:

> lapply
function (X, FUN, ...) 
{
    FUN <- match.fun(FUN)
    if (!is.vector(X) || is.object(X)) 
        X <- as.list(X)
    .Internal(lapply(X, FUN))
}
<environment: namespace:base>

and you should see why there can be a difference in speed between lapply() and for() and the other apply family functions. The .Internal() is one of R's ways of calling compiled C code used by R itself. Apart from a manipulation, and a sanity check on FUN, the entire computation is done in C, calling the R function FUN. Compare that with the source for apply().

你应该明白为什么lapply()和for()以及其他应用族函数之间的速度可能存在差异。 .Internal()是R调用R本身使用的编译C代码的方法之一。除了操作和FUN的健全性检查之外,整个计算在C中完成,调用R函数FUN。将其与apply()的源进行比较。

#2


3  

From Burns' R Inferno (pdf), p25:

来自Burns'R Inferno(pdf),第25页:

Use an explicit for loop when each iteration is a non-trivial task. But a simple loop can be more clearly and compactly expressed using an apply function. There is at least one exception to this rule ... if the result will be a list and some of the components can be NULL, then a for loop is trouble (big trouble) and lapply gives the expected answer.

当每次迭代都是一个非平凡的任务时,使用显式的for循环。但是使用apply函数可以更清晰,更紧凑地表达简单的循环。这个规则至少有一个例外...如果结果是一个列表而某些组件可能是NULL,那么for循环就会出现问题(大麻烦),而lapply给出了预期的答案。