分组功能(tapply, by,聚合)和*应用家庭。

时间:2022-01-08 22:41:59

Whenever I want to do something "map"py in R, I usually try to use a function in the apply family.

每当我想在R中做一个“map”py时,我通常尝试在应用程序族中使用一个函数。

However, I've never quite understood the differences between them -- how {sapply, lapply, etc.} apply the function to the input/grouped input, what the output will look like, or even what the input can be -- so I often just go through them all until I get what I want.

但是,我从来没有完全理解它们之间的区别——{sapply, lapply, etc.}将函数应用于输入/分组输入,输出将是什么样子,或者甚至是输入是什么——所以我经常在得到我想要的东西之前一直浏览它们。

Can someone explain how to use which one when?

有人能解释一下如何使用吗?

My current (probably incorrect/incomplete) understanding is...

我现在(可能不正确/不完全)理解是……

  1. sapply(vec, f): input is a vector. output is a vector/matrix, where element i is f(vec[i]), giving you a matrix if f has a multi-element output

    sapply(vec, f):输入是一个矢量。输出是一个向量/矩阵,其中元素i为f(vec[i]),如果f具有多元素输出,则给出一个矩阵。

  2. lapply(vec, f): same as sapply, but output is a list?

    lapply(vec, f):与sapply相同,但输出是一个列表?

  3. apply(matrix, 1/2, f): input is a matrix. output is a vector, where element i is f(row/col i of the matrix)
  4. 应用(矩阵,1/2,f):输入是一个矩阵。输出是一个向量,其中元素i为f(矩阵的行/col i)
  5. tapply(vector, grouping, f): output is a matrix/array, where an element in the matrix/array is the value of f at a grouping g of the vector, and g gets pushed to the row/col names
  6. tapply(向量,分组,f):输出是一个矩阵/数组,其中矩阵/数组中的一个元素是f在一个集合g中的值,g被推到行/col名称。
  7. by(dataframe, grouping, f): let g be a grouping. apply f to each column of the group/dataframe. pretty print the grouping and the value of f at each column.
  8. 通过(dataframe,分组,f):让g成为一个分组。将f应用于组/dataframe的每一列。漂亮的打印分组和f在每列的值。
  9. aggregate(matrix, grouping, f): similar to by, but instead of pretty printing the output, aggregate sticks everything into a dataframe.
  10. 聚合(矩阵,分组,f):类似的,但不是打印输出,而是把所有的东西都粘贴到一个dataframe中。

Side question: I still haven't learned plyr or reshape -- would plyr or reshape replace all of these entirely?

问:我还没有学过plyr或整形——plyr或整形会完全取代这些吗?

9 个解决方案

#1


1172  

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

R有许多*应用功能,这些功能在帮助文件中可以很好地描述(例如,应用)。但是,有足够多的人,开始的用户可能很难决定哪一个适合他们的情况,甚至是记住他们。他们可能有一种普遍的感觉,即“我应该在这里使用*apply函数”,但要在一开始就把它们都弄清楚是很困难的。

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

尽管事实(在其他答案中指出),*应用家庭的大部分功能都被非常流行的plyr包所覆盖,但是基本功能仍然有用并且值得了解。

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

这个答案的目的是作为新用户的一种路标,帮助他们指导他们正确的应用功能。注意,这不是简单地反刍或取代R文档!希望这个答案可以帮助你决定哪个*应用功能适合你的情况,然后你可以进一步研究它。只有一个例外,性能差异不会得到解决。

  • apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

    应用—当您想要将一个函数应用到矩阵的行或列(以及高维的类似物)时;一般来说,对于数据帧来说,这是不可取的,因为它会首先强制一个矩阵。

    # Two dimensional matrix
    M <- matrix(seq(1,16), 4, 4)
    
    # apply min to rows
    apply(M, 1, min)
    [1] 1 2 3 4
    
    # apply max to columns
    apply(M, 2, max)
    [1]  4  8 12 16
    
    # 3 dimensional array
    M <- array( seq(32), dim = c(4,4,2))
    
    # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
    apply(M, 1, sum)
    # Result is one-dimensional
    [1] 120 128 136 144
    
    # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
    apply(M, c(1,2), sum)
    # Result is two-dimensional
         [,1] [,2] [,3] [,4]
    [1,]   18   26   34   42
    [2,]   20   28   36   44
    [3,]   22   30   38   46
    [4,]   24   32   40   48
    

    If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.

    如果您想要一个2D矩阵的行/列方法或和,一定要研究高度优化的、闪电快速的colMeans、rowMeans、colsum、rowsum。

  • lapply - When you want to apply a function to each element of a list in turn and get a list back.

    lapply——当您想要将一个函数应用到列表中的每个元素时,然后返回一个列表。

    This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.

    这是许多其他*应用函数的工作马。剥去他们的代码,你会发现下面是lapply。

    x <- list(a = 1, b = 1:3, c = 10:100) 
    lapply(x, FUN = length) 
    $a 
    [1] 1
    $b 
    [1] 3
    $c 
    [1] 91
    lapply(x, FUN = sum) 
    $a 
    [1] 1
    $b 
    [1] 6
    $c 
    [1] 5005
    
  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

    当你想要将一个函数应用到列表中的每个元素时,你需要的是一个向量,而不是一个列表。

    If you find yourself typing unlist(lapply(...)), stop and consider sapply.

    如果你发现自己在键入unlist(lapply(…)),停止并考虑sapply。

    x <- list(a = 1, b = 1:3, c = 10:100)
    # Compare with above; a named vector, not a list 
    sapply(x, FUN = length)  
    a  b  c   
    1  3 91
    
    sapply(x, FUN = sum)   
    a    b    c    
    1    6 5005 
    

    In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

    在更高级的sapply应用中,它将尝试在适当的情况下将结果强制转换为多维数组。例如,如果函数返回相同长度的向量,则sapply将它们作为矩阵的列:

    sapply(1:5,function(x) rnorm(3,x))
    

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

    如果我们的函数返回一个二维矩阵,sapply将会做本质上相同的事情,把每个返回的矩阵当作一个单一的长向量:

    sapply(1:5,function(x) matrix(x,2,2))
    

    Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

    除非我们指定简化= "数组",在这种情况下,它将使用单个矩阵来构建多维数组:

    sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
    

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

    每一种行为当然取决于我们的函数返回的向量或相同长度或维度的矩阵。

  • vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code.

    vapply——当您想使用sapply时,可能需要从代码中挤出一些速度。

    For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

    对于vapply,您基本上可以给R一个示例,说明函数将返回什么类型的东西,这可以节省一些时间强制返回的值以适应单个原子向量。

    x <- list(a = 1, b = 1:3, c = 10:100)
    #Note that since the advantage here is mainly speed, this
    # example is only for illustration. We're telling R that
    # everything returned by length() should be an integer of 
    # length 1. 
    vapply(x, FUN = length, FUN.VALUE = 0L) 
    a  b  c  
    1  3 91
    
  • mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

    mapply——当你有几个数据结构(例如向量、列表)时,你想要将一个函数应用到每个元素的第1个元素,然后将每个元素的第2个元素,等等,将结果强制转换成一个向量/数组,就像在sapply中一样。

    This is multivariate in the sense that your function must accept multiple arguments.

    这是多变量的,因为您的函数必须接受多个参数。

    #Sums the 1st elements, the 2nd elements, etc. 
    mapply(sum, 1:5, 1:5, 1:5) 
    [1]  3  6  9 12 15
    #To do rep(1,4), rep(2,3), etc.
    mapply(rep, 1:4, 4:1)   
    [[1]]
    [1] 1 1 1 1
    
    [[2]]
    [1] 2 2 2
    
    [[3]]
    [1] 3 3
    
    [[4]]
    [1] 4
    
  • Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

    映射-一个用简化= FALSE的包装器,因此它保证返回一个列表。

    Map(sum, 1:5, 1:5, 1:5)
    [[1]]
    [1] 3
    
    [[2]]
    [1] 6
    
    [[3]]
    [1] 9
    
    [[4]]
    [1] 12
    
    [[5]]
    [1] 15
    
  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    rapply——当您想要将一个函数应用到嵌套列表结构的每个元素时,递归地执行。

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

    为了让你知道rapply有多不寻常,我在第一次发布这个答案的时候就忘了它!很明显,我相信很多人都用它,但是YMMV。rapply最好用用户定义的函数来说明:

    # Append ! to string, otherwise increment
    myFun <- function(x){
        if(is.character(x)){
          return(paste(x,"!",sep=""))
        }
        else{
          return(x + 1)
        }
    }
    
    #A nested list structure
    l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
              b = 3, c = "Yikes", 
              d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
    
    
    # Result is named vector, coerced to character          
    rapply(l, myFun)
    
    # Result is a nested list like l, with values altered
    rapply(l, myFun, how="replace")
    
  • tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

    tapply——当你想将一个函数应用到一个向量的子集,而子集是由另一个向量定义的,通常是一个因子。

    The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

    * * *的害群之马。帮助文件使用“不规则数组”这个短语可能有点令人困惑,但实际上非常简单。

    A vector:

    一个向量:

    x <- 1:20
    

    A factor (of the same length!) defining groups:

    一个因素(相同长度!)定义组:

    y <- factor(rep(letters[1:5], each = 4))
    

    Add up the values in x within each subgroup defined by y:

    在y定义的每个子组中,将x的值相加:

    tapply(x, y, sum)  
     a  b  c  d  e  
    10 26 42 58 74 
    

    More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

    可以处理更复杂的示例,其中的子组由几个因素的列表的惟一组合定义。tapply在spirit中类似于在R(聚合、by、ave、ddply等)中常见的分割-应用组合函数,因此它是黑羊状态。

#2


167  

On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)

另一方面,这里是不同的plyr函数如何对应于基本*应用函数(从plyr网页http://had.co.nz/plyr/)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply 

One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.

plyr的目标之一是为每个函数提供一致的命名约定,在函数名中编码输入和输出数据类型。它还提供了输出的一致性,从dlply()输出可以轻松地传递到ldply()以产生有用的输出,等等。

Conceptually, learning plyr is no more difficult than understanding the base *apply functions.

从概念上讲,学习plyr并不比理解基本的应用功能困难。

plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:

在我的日常使用中,plyr和整形功能几乎取代了所有这些功能。但是,也从简介到Plyr文件:

Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.

相关函数tapply和扫描在plyr中没有相应的功能,并且仍然有用。合并对于将总结与原始数据结合起来很有用。

#3


116  

From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:

来自http://www.slideshare.net/hadley/plyr-one-data- analysis -strategy的幻灯片21

分组功能(tapply, by,聚合)和*应用家庭。

(Hopefully it's clear that apply corresponds to @Hadley's aaply and aggregate corresponds to @Hadley's ddply etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)

(希望很清楚,apply与@Hadley的aaply相对应,聚合对应于@Hadley的ddply等。如果你不从这张图片中得到它,同样的slideshare的20张幻灯片将会澄清。)

(on the left is input, on the top is output)

(左边是输入,顶部是输出)

#4


84  

First start with Joran's excellent answer -- doubtful anything can better that.

首先从Joran的出色回答开始——怀疑任何事情都能更好。

Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.

接下来的记忆术可能有助于记住每个人之间的区别。虽然有些是显而易见的,其他的可能不那么重要——因为这些你将在Joran的讨论中找到理由。

Mnemonics

助记符

  • lapply is a list apply which acts on a list or vector and returns a list.
  • lapply是一个列表,它作用于列表或向量,并返回一个列表。
  • sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
  • sapply是一个简单的lapply(在可能的情况下,函数默认返回一个矢量或矩阵)
  • vapply is a verified apply (allows the return object type to be prespecified)
  • vapply是一个经过验证的应用程序(允许预先指定返回对象类型)
  • rapply is a recursive apply for nested lists, i.e. lists within lists
  • rapply是一个递归应用于嵌套列表,即列表中的列表。
  • tapply is a tagged apply where the tags identify the subsets
  • tapply是一个标记应用程序,其中标记标识子集。
  • apply is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)
  • apply是通用的:将一个函数应用到矩阵的行或列(或者,更一般地说,是一个数组的维度)

Building the Right Background

建立正确的背景

If using the apply family still feels a bit alien to you, then it might be that you're missing a key point of view.

如果使用应用程序家庭对你来说仍然感觉有点陌生,那么可能是你忽略了一个关键的观点。

These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply family of functions.

这两篇文章能帮上忙。它们提供了必要的背景,以激发应用程序家族提供的函数式编程技术。

Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply will make a lot more sense.

Lisp的用户会立即识别这个范例。如果你对Lisp不熟悉,一旦你熟悉了FP,你就会在R中获得一个强有力的观点,而且应用会更有意义。

#5


34  

Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.

因为我意识到(非常优秀的)这篇文章的答案缺乏和聚合的解释。这是我的贡献。

BY

The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:

正如文档中所述,通过函数可以作为tapply的“包装器”。当我们想要计算一个tapply无法处理的任务时,就会产生这种能力。一个例子就是这个代码:

ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )

 cb
iris$Species: setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 
-------------------------------------------------------------- 
iris$Species: versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 
-------------------------------------------------------------- 
iris$Species: virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 


ct
$setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 

$versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 

$virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 

If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.

如果我们打印这两个对象,ct和cb,我们“本质上”具有相同的结果,唯一的区别在于它们是如何显示的,以及不同的类属性,分别是用于ct的cb和数组。

As I've said, the power of by arises when we can't use tapply; the following code is one example:

正如我说过的,当我们不能使用tapply时,它的力量就会出现;下面的代码就是一个例子:

 tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) : 
  arguments must have same length

R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.

R说,参数必须有相同的长度,比如“我们想要计算的是所有的可变因素在鸢尾中的数量”:但是R不能这么做,因为它不知道如何处理。

With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.

通过函数R调度一个特定的数据帧类方法,即使第一个参数的长度(和类型)不同,也让summary函数工作。

bywork <- by(iris, iris$Species, summary )

bywork
iris$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
-------------------------------------------------------------- 
iris$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
-------------------------------------------------------------- 
iris$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     

it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.

它确实有效,结果非常令人惊讶。它是一个类的对象,沿着物种(比方说,对每一个物种)计算每个变量的摘要。

Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:

注意,如果第一个参数是一个数据帧,那么被分派的函数必须有一个对象类的方法。例如,我们使用的是这个带有平均功能的代码我们将会有这个没有任何意义的代码:

 by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
------------------------------------------- 
iris$Species: versicolor
[1] NA
------------------------------------------- 
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA

AGGREGATE

aggregate can be seen as another a different way of use tapply if we use it in such a way.

聚合可以被看作是另一种不同的使用方法,如果我们以这种方式使用它。

at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)

 at
    setosa versicolor  virginica 
     5.006      5.936      6.588 
 ag
     Group.1     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.

两个直接的区别是,聚合的第二个参数必须是一个列表,而tapply可以(不是必须的)是一个列表,而聚合的输出是一个数据帧,而tapply的输出是一个数组。

The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.

聚合的力量在于它可以用子集参数来处理数据的子集,并且它也有ts对象和公式的方法。

These elements make aggregate easier to work with that tapply in some situations. Here are some examples (available in documentation):

在某些情况下,这些元素使聚合更容易处理。这里有一些例子(可以在文档中找到):

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

 ag
  supp dose   len
1   OJ  0.5 13.23
2   VC  0.5  7.98
3   OJ  1.0 22.70
4   VC  1.0 16.77
5   OJ  2.0 26.06
6   VC  2.0 26.14

We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:

我们可以用tapply实现同样的效果,但是语法稍微困难一些,输出(在某些情况下)可读性更差:

att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)

 att
       OJ    VC
0.5 13.23  7.98
1   22.70 16.77
2   26.06 26.14

There are other times when we can't use by or tapply and we have to use aggregate.

还有一些时候我们不能使用或tapply,我们必须使用聚合。

 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

 ag1
  Month    Ozone     Temp
1     5 23.61538 66.73077
2     6 29.44444 78.22222
3     7 59.11538 83.88462
4     8 59.96154 83.96154
5     9 31.44828 76.89655

We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):

我们不能在一个调用中得到之前的结果,但是我们必须计算每个元素的平均月数,然后再组合它们(还要注意我们必须调用na。rm = TRUE,因为聚合函数的公式方法默认为na。action = na.omit):

ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)

 cbind(ta1, ta2)
       ta1      ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000

while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):

虽然我们无法做到这一点,但实际上以下函数调用返回一个错误(但很可能它与提供的函数有关):

by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)

Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:

其他时候,结果是相同的,差别只是在类中(然后是如何显示/打印的,而不仅仅是——例如,如何对它进行子集)对象:

byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)

The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.

前面的代码实现了相同的目标和结果,在某些点上使用什么工具只是个人的爱好和需求的问题;前两个对象在子设置方面有非常不同的需求。

#6


27  

There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply function is not measured.

有很多很好的答案来讨论每个函数的用例的不同。没有一个答案讨论性能上的差异。这是合理的原因,各种函数期望不同的输入并产生不同的输出,但是大多数的函数都有一个通用的目标,可以通过序列/组来进行评估。我的答案是专注于表现。由于在时间上包含了来自矢量的输入,所以应用函数也没有被测量。

I have tested two different functions sum and length at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table and dplyr. Both are definitely worth to look if you are aiming for good performance.

我同时测试了两个不同的函数和长度。测试的音量为50M,输出为50K。我还包括了两个当前流行的软件包,在被问及问题时,它们并没有被广泛使用。表和dplyr。如果你的目标是良好的表现,两者都是值得一看的。

library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing = list()

# sapply
timing[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing[["dplyr"]] = system.time({
    df = data_frame(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
       function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
#    sapply     lapply     tapply         by  aggregate      dplyr data.table 
#      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 

# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
              )[,.(fun = V1, elapsed = V2)
                ][order(-elapsed)]
#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

#7


19  

It is maybe worth mentioning ave. ave is tapply's friendly cousin. It returns results in a form that you can plug straight back into your data frame.

也许值得一提的是,ave是tapply的友好表亲。它以一种可以直接插入到数据帧的形式返回结果。

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

There is nothing in the base package that works like ave for whole data frames (as by is like tapply for data frames). But you can fudge it:

在整个数据帧中,基本包中没有像ave这样的东西(就像对数据帧的tapply一样)。但你可以蒙混过去:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

#8


19  

Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer function and the obscure eapply function

尽管这里有很多重要的答案,但还有2个基本功能值得提及,有用的外部函数和模糊的eapply函数。

outer

outer is a very useful function hidden as a more mundane one. If you read the help for outer its description says:

外表是一种非常有用的功能,隐藏在一个更平凡的功能中。如果你读到外部的帮助,它的描述是:

The outer product of the arrays X and Y is the array A with dimension  
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =   
FUN(X[arrayindex.x], Y[arrayindex.y], ...).

which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply to apply a function to two vectors of inputs. The difference is that mapply will apply the function to the first two elements and then the second two etc, whereas outer will apply the function to every combination of one element from the first vector and one from the second. For example:

这使得它看起来只适用于线性代数类型的东西。但是,它可以很像mapply,将一个函数应用到两个输入向量。不同之处在于,mapply会将函数应用到前两个元素,然后将第二个元素应用到第二个元素,而外层则将这个函数应用于一个元素从第一个向量到第二个元素的每一个组合。例如:

 A<-c(1,3,5,7,9)
 B<-c(0,3,6,9,12)

mapply(FUN=pmax, A, B)

> mapply(FUN=pmax, A, B)
[1]  1  3  6  9 12

outer(A,B, pmax)

 > outer(A,B, pmax)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    3    6    9   12
 [2,]    3    3    6    9   12
 [3,]    5    5    6    9   12
 [4,]    7    7    7    9   12
 [5,]    9    9    9    9   12

I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.

当我有一个值向量和一个条件向量时,我就用这个方法,希望看到哪个值满足条件。

eapply

eapply

eapply is like lapply except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:

eapply就像lapply,只不过它不是将函数应用到列表中的每个元素,而是将函数应用到环境中的每个元素。例如,如果您想在全局环境中查找用户定义的函数列表:

A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}

> eapply(.GlobalEnv, is.function)
$A
[1] FALSE

$B
[1] FALSE

$C
[1] FALSE

$D
[1] TRUE 

Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.

坦率地说,我并没有过多地使用它,但是如果您正在构建大量的包或创建许多环境,那么它可能会派上用场。

#9


4  

I recently discovered the rather useful sweep function and add it here for the sake of completeness:

我最近发现了一个非常有用的扫描函数,并将其添加到这里,以确保完整性:

sweep

扫描

The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):

基本思想是扫描数组行或列,并返回修改后的数组。一个示例将说明这一点(来源:datacamp):

Let's say you have a matrix and want to standardize it column-wise:

假设你有一个矩阵,想要使它标准化:

dataPoints <- matrix(4:15, nrow = 4)

# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)

# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)

# Center the points 
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
print(dataPoints_Trans1)
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5
# Return the result
dataPoints_Trans1
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5
# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")

# Return the result
dataPoints_Trans2
##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

NB: for this simple example the same result can of course be achieved more easily by
apply(dataPoints, 2, scale)

NB:对于这个简单的例子来说,同样的结果当然可以通过应用(dataPoints, 2, scale)更容易实现。

#1


1172  

R has many *apply functions which are ably described in the help files (e.g. ?apply). There are enough of them, though, that beginning useRs may have difficulty deciding which one is appropriate for their situation or even remembering them all. They may have a general sense that "I should be using an *apply function here", but it can be tough to keep them all straight at first.

R有许多*应用功能,这些功能在帮助文件中可以很好地描述(例如,应用)。但是,有足够多的人,开始的用户可能很难决定哪一个适合他们的情况,甚至是记住他们。他们可能有一种普遍的感觉,即“我应该在这里使用*apply函数”,但要在一开始就把它们都弄清楚是很困难的。

Despite the fact (noted in other answers) that much of the functionality of the *apply family is covered by the extremely popular plyr package, the base functions remain useful and worth knowing.

尽管事实(在其他答案中指出),*应用家庭的大部分功能都被非常流行的plyr包所覆盖,但是基本功能仍然有用并且值得了解。

This answer is intended to act as a sort of signpost for new useRs to help direct them to the correct *apply function for their particular problem. Note, this is not intended to simply regurgitate or replace the R documentation! The hope is that this answer helps you to decide which *apply function suits your situation and then it is up to you to research it further. With one exception, performance differences will not be addressed.

这个答案的目的是作为新用户的一种路标,帮助他们指导他们正确的应用功能。注意,这不是简单地反刍或取代R文档!希望这个答案可以帮助你决定哪个*应用功能适合你的情况,然后你可以进一步研究它。只有一个例外,性能差异不会得到解决。

  • apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.

    应用—当您想要将一个函数应用到矩阵的行或列(以及高维的类似物)时;一般来说,对于数据帧来说,这是不可取的,因为它会首先强制一个矩阵。

    # Two dimensional matrix
    M <- matrix(seq(1,16), 4, 4)
    
    # apply min to rows
    apply(M, 1, min)
    [1] 1 2 3 4
    
    # apply max to columns
    apply(M, 2, max)
    [1]  4  8 12 16
    
    # 3 dimensional array
    M <- array( seq(32), dim = c(4,4,2))
    
    # Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
    apply(M, 1, sum)
    # Result is one-dimensional
    [1] 120 128 136 144
    
    # Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
    apply(M, c(1,2), sum)
    # Result is two-dimensional
         [,1] [,2] [,3] [,4]
    [1,]   18   26   34   42
    [2,]   20   28   36   44
    [3,]   22   30   38   46
    [4,]   24   32   40   48
    

    If you want row/column means or sums for a 2D matrix, be sure to investigate the highly optimized, lightning-quick colMeans, rowMeans, colSums, rowSums.

    如果您想要一个2D矩阵的行/列方法或和,一定要研究高度优化的、闪电快速的colMeans、rowMeans、colsum、rowsum。

  • lapply - When you want to apply a function to each element of a list in turn and get a list back.

    lapply——当您想要将一个函数应用到列表中的每个元素时,然后返回一个列表。

    This is the workhorse of many of the other *apply functions. Peel back their code and you will often find lapply underneath.

    这是许多其他*应用函数的工作马。剥去他们的代码,你会发现下面是lapply。

    x <- list(a = 1, b = 1:3, c = 10:100) 
    lapply(x, FUN = length) 
    $a 
    [1] 1
    $b 
    [1] 3
    $c 
    [1] 91
    lapply(x, FUN = sum) 
    $a 
    [1] 1
    $b 
    [1] 6
    $c 
    [1] 5005
    
  • sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.

    当你想要将一个函数应用到列表中的每个元素时,你需要的是一个向量,而不是一个列表。

    If you find yourself typing unlist(lapply(...)), stop and consider sapply.

    如果你发现自己在键入unlist(lapply(…)),停止并考虑sapply。

    x <- list(a = 1, b = 1:3, c = 10:100)
    # Compare with above; a named vector, not a list 
    sapply(x, FUN = length)  
    a  b  c   
    1  3 91
    
    sapply(x, FUN = sum)   
    a    b    c    
    1    6 5005 
    

    In more advanced uses of sapply it will attempt to coerce the result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply will use them as columns of a matrix:

    在更高级的sapply应用中,它将尝试在适当的情况下将结果强制转换为多维数组。例如,如果函数返回相同长度的向量,则sapply将它们作为矩阵的列:

    sapply(1:5,function(x) rnorm(3,x))
    

    If our function returns a 2 dimensional matrix, sapply will do essentially the same thing, treating each returned matrix as a single long vector:

    如果我们的函数返回一个二维矩阵,sapply将会做本质上相同的事情,把每个返回的矩阵当作一个单一的长向量:

    sapply(1:5,function(x) matrix(x,2,2))
    

    Unless we specify simplify = "array", in which case it will use the individual matrices to build a multi-dimensional array:

    除非我们指定简化= "数组",在这种情况下,它将使用单个矩阵来构建多维数组:

    sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
    

    Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.

    每一种行为当然取决于我们的函数返回的向量或相同长度或维度的矩阵。

  • vapply - When you want to use sapply but perhaps need to squeeze some more speed out of your code.

    vapply——当您想使用sapply时,可能需要从代码中挤出一些速度。

    For vapply, you basically give R an example of what sort of thing your function will return, which can save some time coercing returned values to fit in a single atomic vector.

    对于vapply,您基本上可以给R一个示例,说明函数将返回什么类型的东西,这可以节省一些时间强制返回的值以适应单个原子向量。

    x <- list(a = 1, b = 1:3, c = 10:100)
    #Note that since the advantage here is mainly speed, this
    # example is only for illustration. We're telling R that
    # everything returned by length() should be an integer of 
    # length 1. 
    vapply(x, FUN = length, FUN.VALUE = 0L) 
    a  b  c  
    1  3 91
    
  • mapply - For when you have several data structures (e.g. vectors, lists) and you want to apply a function to the 1st elements of each, and then the 2nd elements of each, etc., coercing the result to a vector/array as in sapply.

    mapply——当你有几个数据结构(例如向量、列表)时,你想要将一个函数应用到每个元素的第1个元素,然后将每个元素的第2个元素,等等,将结果强制转换成一个向量/数组,就像在sapply中一样。

    This is multivariate in the sense that your function must accept multiple arguments.

    这是多变量的,因为您的函数必须接受多个参数。

    #Sums the 1st elements, the 2nd elements, etc. 
    mapply(sum, 1:5, 1:5, 1:5) 
    [1]  3  6  9 12 15
    #To do rep(1,4), rep(2,3), etc.
    mapply(rep, 1:4, 4:1)   
    [[1]]
    [1] 1 1 1 1
    
    [[2]]
    [1] 2 2 2
    
    [[3]]
    [1] 3 3
    
    [[4]]
    [1] 4
    
  • Map - A wrapper to mapply with SIMPLIFY = FALSE, so it is guaranteed to return a list.

    映射-一个用简化= FALSE的包装器,因此它保证返回一个列表。

    Map(sum, 1:5, 1:5, 1:5)
    [[1]]
    [1] 3
    
    [[2]]
    [1] 6
    
    [[3]]
    [1] 9
    
    [[4]]
    [1] 12
    
    [[5]]
    [1] 15
    
  • rapply - For when you want to apply a function to each element of a nested list structure, recursively.

    rapply——当您想要将一个函数应用到嵌套列表结构的每个元素时,递归地执行。

    To give you some idea of how uncommon rapply is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply is best illustrated with a user-defined function to apply:

    为了让你知道rapply有多不寻常,我在第一次发布这个答案的时候就忘了它!很明显,我相信很多人都用它,但是YMMV。rapply最好用用户定义的函数来说明:

    # Append ! to string, otherwise increment
    myFun <- function(x){
        if(is.character(x)){
          return(paste(x,"!",sep=""))
        }
        else{
          return(x + 1)
        }
    }
    
    #A nested list structure
    l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
              b = 3, c = "Yikes", 
              d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
    
    
    # Result is named vector, coerced to character          
    rapply(l, myFun)
    
    # Result is a nested list like l, with values altered
    rapply(l, myFun, how="replace")
    
  • tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.

    tapply——当你想将一个函数应用到一个向量的子集,而子集是由另一个向量定义的,通常是一个因子。

    The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.

    * * *的害群之马。帮助文件使用“不规则数组”这个短语可能有点令人困惑,但实际上非常简单。

    A vector:

    一个向量:

    x <- 1:20
    

    A factor (of the same length!) defining groups:

    一个因素(相同长度!)定义组:

    y <- factor(rep(letters[1:5], each = 4))
    

    Add up the values in x within each subgroup defined by y:

    在y定义的每个子组中,将x的值相加:

    tapply(x, y, sum)  
     a  b  c  d  e  
    10 26 42 58 74 
    

    More complex examples can be handled where the subgroups are defined by the unique combinations of a list of several factors. tapply is similar in spirit to the split-apply-combine functions that are common in R (aggregate, by, ave, ddply, etc.) Hence its black sheep status.

    可以处理更复杂的示例,其中的子组由几个因素的列表的惟一组合定义。tapply在spirit中类似于在R(聚合、by、ave、ddply等)中常见的分割-应用组合函数,因此它是黑羊状态。

#2


167  

On the side note, here is how the various plyr functions correspond to the base *apply functions (from the intro to plyr document from the plyr webpage http://had.co.nz/plyr/)

另一方面,这里是不同的plyr函数如何对应于基本*应用函数(从plyr网页http://had.co.nz/plyr/)

Base function   Input   Output   plyr function 
---------------------------------------
aggregate        d       d       ddply + colwise 
apply            a       a/l     aaply / alply 
by               d       l       dlply 
lapply           l       l       llply  
mapply           a       a/l     maply / mlply 
replicate        r       a/l     raply / rlply 
sapply           l       a       laply 

One of the goals of plyr is to provide consistent naming conventions for each of the functions, encoding the input and output data types in the function name. It also provides consistency in output, in that output from dlply() is easily passable to ldply() to produce useful output, etc.

plyr的目标之一是为每个函数提供一致的命名约定,在函数名中编码输入和输出数据类型。它还提供了输出的一致性,从dlply()输出可以轻松地传递到ldply()以产生有用的输出,等等。

Conceptually, learning plyr is no more difficult than understanding the base *apply functions.

从概念上讲,学习plyr并不比理解基本的应用功能困难。

plyr and reshape functions have replaced almost all of these functions in my every day use. But, also from the Intro to Plyr document:

在我的日常使用中,plyr和整形功能几乎取代了所有这些功能。但是,也从简介到Plyr文件:

Related functions tapply and sweep have no corresponding function in plyr, and remain useful. merge is useful for combining summaries with the original data.

相关函数tapply和扫描在plyr中没有相应的功能,并且仍然有用。合并对于将总结与原始数据结合起来很有用。

#3


116  

From slide 21 of http://www.slideshare.net/hadley/plyr-one-data-analytic-strategy:

来自http://www.slideshare.net/hadley/plyr-one-data- analysis -strategy的幻灯片21

分组功能(tapply, by,聚合)和*应用家庭。

(Hopefully it's clear that apply corresponds to @Hadley's aaply and aggregate corresponds to @Hadley's ddply etc. Slide 20 of the same slideshare will clarify if you don't get it from this image.)

(希望很清楚,apply与@Hadley的aaply相对应,聚合对应于@Hadley的ddply等。如果你不从这张图片中得到它,同样的slideshare的20张幻灯片将会澄清。)

(on the left is input, on the top is output)

(左边是输入,顶部是输出)

#4


84  

First start with Joran's excellent answer -- doubtful anything can better that.

首先从Joran的出色回答开始——怀疑任何事情都能更好。

Then the following mnemonics may help to remember the distinctions between each. Whilst some are obvious, others may be less so --- for these you'll find justification in Joran's discussions.

接下来的记忆术可能有助于记住每个人之间的区别。虽然有些是显而易见的,其他的可能不那么重要——因为这些你将在Joran的讨论中找到理由。

Mnemonics

助记符

  • lapply is a list apply which acts on a list or vector and returns a list.
  • lapply是一个列表,它作用于列表或向量,并返回一个列表。
  • sapply is a simple lapply (function defaults to returning a vector or matrix when possible)
  • sapply是一个简单的lapply(在可能的情况下,函数默认返回一个矢量或矩阵)
  • vapply is a verified apply (allows the return object type to be prespecified)
  • vapply是一个经过验证的应用程序(允许预先指定返回对象类型)
  • rapply is a recursive apply for nested lists, i.e. lists within lists
  • rapply是一个递归应用于嵌套列表,即列表中的列表。
  • tapply is a tagged apply where the tags identify the subsets
  • tapply是一个标记应用程序,其中标记标识子集。
  • apply is generic: applies a function to a matrix's rows or columns (or, more generally, to dimensions of an array)
  • apply是通用的:将一个函数应用到矩阵的行或列(或者,更一般地说,是一个数组的维度)

Building the Right Background

建立正确的背景

If using the apply family still feels a bit alien to you, then it might be that you're missing a key point of view.

如果使用应用程序家庭对你来说仍然感觉有点陌生,那么可能是你忽略了一个关键的观点。

These two articles can help. They provide the necessary background to motivate the functional programming techniques that are being provided by the apply family of functions.

这两篇文章能帮上忙。它们提供了必要的背景,以激发应用程序家族提供的函数式编程技术。

Users of Lisp will recognise the paradigm immediately. If you're not familiar with Lisp, once you get your head around FP, you'll have gained a powerful point of view for use in R -- and apply will make a lot more sense.

Lisp的用户会立即识别这个范例。如果你对Lisp不熟悉,一旦你熟悉了FP,你就会在R中获得一个强有力的观点,而且应用会更有意义。

#5


34  

Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.

因为我意识到(非常优秀的)这篇文章的答案缺乏和聚合的解释。这是我的贡献。

BY

The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:

正如文档中所述,通过函数可以作为tapply的“包装器”。当我们想要计算一个tapply无法处理的任务时,就会产生这种能力。一个例子就是这个代码:

ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )

 cb
iris$Species: setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 
-------------------------------------------------------------- 
iris$Species: versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 
-------------------------------------------------------------- 
iris$Species: virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 


ct
$setosa
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.300   3.200   3.400   3.428   3.675   4.400 

$versicolor
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.000   2.525   2.800   2.770   3.000   3.400 

$virginica
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.200   2.800   3.000   2.974   3.175   3.800 

If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.

如果我们打印这两个对象,ct和cb,我们“本质上”具有相同的结果,唯一的区别在于它们是如何显示的,以及不同的类属性,分别是用于ct的cb和数组。

As I've said, the power of by arises when we can't use tapply; the following code is one example:

正如我说过的,当我们不能使用tapply时,它的力量就会出现;下面的代码就是一个例子:

 tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) : 
  arguments must have same length

R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.

R说,参数必须有相同的长度,比如“我们想要计算的是所有的可变因素在鸢尾中的数量”:但是R不能这么做,因为它不知道如何处理。

With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.

通过函数R调度一个特定的数据帧类方法,即使第一个参数的长度(和类型)不同,也让summary函数工作。

bywork <- by(iris, iris$Species, summary )

bywork
iris$Species: setosa
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
 Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
 Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
 3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
 Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
-------------------------------------------------------------- 
iris$Species: versicolor
  Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
 Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
 1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
 Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
 Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
 3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
 Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
-------------------------------------------------------------- 
iris$Species: virginica
  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
 1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
 Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
 Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
 3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
 Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     

it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.

它确实有效,结果非常令人惊讶。它是一个类的对象,沿着物种(比方说,对每一个物种)计算每个变量的摘要。

Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:

注意,如果第一个参数是一个数据帧,那么被分派的函数必须有一个对象类的方法。例如,我们使用的是这个带有平均功能的代码我们将会有这个没有任何意义的代码:

 by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
------------------------------------------- 
iris$Species: versicolor
[1] NA
------------------------------------------- 
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
  argument is not numeric or logical: returning NA

AGGREGATE

aggregate can be seen as another a different way of use tapply if we use it in such a way.

聚合可以被看作是另一种不同的使用方法,如果我们以这种方式使用它。

at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)

 at
    setosa versicolor  virginica 
     5.006      5.936      6.588 
 ag
     Group.1     x
1     setosa 5.006
2 versicolor 5.936
3  virginica 6.588

The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.

两个直接的区别是,聚合的第二个参数必须是一个列表,而tapply可以(不是必须的)是一个列表,而聚合的输出是一个数据帧,而tapply的输出是一个数组。

The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.

聚合的力量在于它可以用子集参数来处理数据的子集,并且它也有ts对象和公式的方法。

These elements make aggregate easier to work with that tapply in some situations. Here are some examples (available in documentation):

在某些情况下,这些元素使聚合更容易处理。这里有一些例子(可以在文档中找到):

ag <- aggregate(len ~ ., data = ToothGrowth, mean)

 ag
  supp dose   len
1   OJ  0.5 13.23
2   VC  0.5  7.98
3   OJ  1.0 22.70
4   VC  1.0 16.77
5   OJ  2.0 26.06
6   VC  2.0 26.14

We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:

我们可以用tapply实现同样的效果,但是语法稍微困难一些,输出(在某些情况下)可读性更差:

att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)

 att
       OJ    VC
0.5 13.23  7.98
1   22.70 16.77
2   26.06 26.14

There are other times when we can't use by or tapply and we have to use aggregate.

还有一些时候我们不能使用或tapply,我们必须使用聚合。

 ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)

 ag1
  Month    Ozone     Temp
1     5 23.61538 66.73077
2     6 29.44444 78.22222
3     7 59.11538 83.88462
4     8 59.96154 83.96154
5     9 31.44828 76.89655

We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):

我们不能在一个调用中得到之前的结果,但是我们必须计算每个元素的平均月数,然后再组合它们(还要注意我们必须调用na。rm = TRUE,因为聚合函数的公式方法默认为na。action = na.omit):

ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)

 cbind(ta1, ta2)
       ta1      ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000

while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):

虽然我们无法做到这一点,但实际上以下函数调用返回一个错误(但很可能它与提供的函数有关):

by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)

Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:

其他时候,结果是相同的,差别只是在类中(然后是如何显示/打印的,而不仅仅是——例如,如何对它进行子集)对象:

byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)

The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.

前面的代码实现了相同的目标和结果,在某些点上使用什么工具只是个人的爱好和需求的问题;前两个对象在子设置方面有非常不同的需求。

#6


27  

There are lots of great answers which discuss differences in the use cases for each function. None of the answer discuss the differences in performance. That is reasonable cause various functions expects various input and produces various output, yet most of them have a general common objective to evaluate by series/groups. My answer is going to focus on performance. Due to above the input creation from the vectors is included in the timing, also the apply function is not measured.

有很多很好的答案来讨论每个函数的用例的不同。没有一个答案讨论性能上的差异。这是合理的原因,各种函数期望不同的输入并产生不同的输出,但是大多数的函数都有一个通用的目标,可以通过序列/组来进行评估。我的答案是专注于表现。由于在时间上包含了来自矢量的输入,所以应用函数也没有被测量。

I have tested two different functions sum and length at once. Volume tested is 50M on input and 50K on output. I have also included two currently popular packages which were not widely used at the time when question was asked, data.table and dplyr. Both are definitely worth to look if you are aiming for good performance.

我同时测试了两个不同的函数和长度。测试的音量为50M,输出为50K。我还包括了两个当前流行的软件包,在被问及问题时,它们并没有被广泛使用。表和dplyr。如果你的目标是良好的表现,两者都是值得一看的。

library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)

timing = list()

# sapply
timing[["sapply"]] = system.time({
    lt = split(x, grp)
    r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})

# lapply
timing[["lapply"]] = system.time({
    lt = split(x, grp)
    r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})

# tapply
timing[["tapply"]] = system.time(
    r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)

# by
timing[["by"]] = system.time(
    r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# aggregate
timing[["aggregate"]] = system.time(
    r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)

# dplyr
timing[["dplyr"]] = system.time({
    df = data_frame(x, grp)
    r.dplyr = summarise(group_by(df, grp), sum(x), n())
})

# data.table
timing[["data.table"]] = system.time({
    dt = setnames(setDT(list(x, grp)), c("x","grp"))
    r.data.table = dt[, .(sum(x), .N), grp]
})

# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
       function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
#    sapply     lapply     tapply         by  aggregate      dplyr data.table 
#      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 

# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
              )[,.(fun = V1, elapsed = V2)
                ][order(-elapsed)]
#          fun elapsed
#1:  aggregate 109.139
#2:         by  25.738
#3:      dplyr  18.978
#4:     tapply  17.006
#5:     lapply  11.524
#6:     sapply  11.326
#7: data.table   2.686

#7


19  

It is maybe worth mentioning ave. ave is tapply's friendly cousin. It returns results in a form that you can plug straight back into your data frame.

也许值得一提的是,ave是tapply的友好表亲。它以一种可以直接插入到数据帧的形式返回结果。

dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
##  A    B    C    D    E 
## 2.5  6.5 10.5 14.5 18.5 

## great, but putting it back in the data frame is another line:

dfr$m <- means[dfr$f]

dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
##   a f    m   m2
##   1 A  2.5  2.5
##   2 A  2.5  2.5
##   3 A  2.5  2.5
##   4 A  2.5  2.5
##   5 B  6.5  6.5
##   6 B  6.5  6.5
##   7 B  6.5  6.5
##   ...

There is nothing in the base package that works like ave for whole data frames (as by is like tapply for data frames). But you can fudge it:

在整个数据帧中,基本包中没有像ave这样的东西(就像对数据帧的tapply一样)。但你可以蒙混过去:

dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
    x <- dfr[x,]
    sum(x$m*x$m2)
})
dfr
##     a f    m   m2    foo
## 1   1 A  2.5  2.5    25
## 2   2 A  2.5  2.5    25
## 3   3 A  2.5  2.5    25
## ...

#8


19  

Despite all the great answers here, there are 2 more base functions that deserve to be mentioned, the useful outer function and the obscure eapply function

尽管这里有很多重要的答案,但还有2个基本功能值得提及,有用的外部函数和模糊的eapply函数。

outer

outer is a very useful function hidden as a more mundane one. If you read the help for outer its description says:

外表是一种非常有用的功能,隐藏在一个更平凡的功能中。如果你读到外部的帮助,它的描述是:

The outer product of the arrays X and Y is the array A with dimension  
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =   
FUN(X[arrayindex.x], Y[arrayindex.y], ...).

which makes it seem like this is only useful for linear algebra type things. However, it can be used much like mapply to apply a function to two vectors of inputs. The difference is that mapply will apply the function to the first two elements and then the second two etc, whereas outer will apply the function to every combination of one element from the first vector and one from the second. For example:

这使得它看起来只适用于线性代数类型的东西。但是,它可以很像mapply,将一个函数应用到两个输入向量。不同之处在于,mapply会将函数应用到前两个元素,然后将第二个元素应用到第二个元素,而外层则将这个函数应用于一个元素从第一个向量到第二个元素的每一个组合。例如:

 A<-c(1,3,5,7,9)
 B<-c(0,3,6,9,12)

mapply(FUN=pmax, A, B)

> mapply(FUN=pmax, A, B)
[1]  1  3  6  9 12

outer(A,B, pmax)

 > outer(A,B, pmax)
      [,1] [,2] [,3] [,4] [,5]
 [1,]    1    3    6    9   12
 [2,]    3    3    6    9   12
 [3,]    5    5    6    9   12
 [4,]    7    7    7    9   12
 [5,]    9    9    9    9   12

I have personally used this when I have a vector of values and a vector of conditions and wish to see which values meet which conditions.

当我有一个值向量和一个条件向量时,我就用这个方法,希望看到哪个值满足条件。

eapply

eapply

eapply is like lapply except that rather than applying a function to every element in a list, it applies a function to every element in an environment. For example if you want to find a list of user defined functions in the global environment:

eapply就像lapply,只不过它不是将函数应用到列表中的每个元素,而是将函数应用到环境中的每个元素。例如,如果您想在全局环境中查找用户定义的函数列表:

A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}

> eapply(.GlobalEnv, is.function)
$A
[1] FALSE

$B
[1] FALSE

$C
[1] FALSE

$D
[1] TRUE 

Frankly I don't use this very much but if you are building a lot of packages or create a lot of environments it may come in handy.

坦率地说,我并没有过多地使用它,但是如果您正在构建大量的包或创建许多环境,那么它可能会派上用场。

#9


4  

I recently discovered the rather useful sweep function and add it here for the sake of completeness:

我最近发现了一个非常有用的扫描函数,并将其添加到这里,以确保完整性:

sweep

扫描

The basic idea is to sweep through an array row- or column-wise and return a modified array. An example will make this clear (source: datacamp):

基本思想是扫描数组行或列,并返回修改后的数组。一个示例将说明这一点(来源:datacamp):

Let's say you have a matrix and want to standardize it column-wise:

假设你有一个矩阵,想要使它标准化:

dataPoints <- matrix(4:15, nrow = 4)

# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)

# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)

# Center the points 
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
print(dataPoints_Trans1)
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5
# Return the result
dataPoints_Trans1
##      [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,]  0.5  0.5  0.5
## [4,]  1.5  1.5  1.5
# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")

# Return the result
dataPoints_Trans2
##            [,1]       [,2]       [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,]  0.3872983  0.3872983  0.3872983
## [4,]  1.1618950  1.1618950  1.1618950

NB: for this simple example the same result can of course be achieved more easily by
apply(dataPoints, 2, scale)

NB:对于这个简单的例子来说,同样的结果当然可以通过应用(dataPoints, 2, scale)更容易实现。