为什么并行包比仅仅使用apply要慢?

时间:2021-08-16 13:50:11

I am trying to determine when to use the parallel package to speed up the time necessary to run some analysis. One of the things I need to do is create matrices comparing variables in two data frames with differing number of rows. I asked a question as to an efficient way of doing on * and wrote about tests on my blog. Since I am comfortable with the best approach I wanted to speed up the process by running it in parallel. The results below are based upon a 2ghz i7 Mac with 8gb of RAM. I am surprised that the parallel package, the parSapply funciton in particular, is worse than just using the apply function. The code to replicate this is below. Note that I am currently only using one of the two columns I create but eventually want to use both.

我正在尝试确定何时使用并行包来加快运行一些分析所需的时间。我需要做的一件事是创建矩阵,比较不同行数的两个数据帧中的变量。我问了一个关于*的有效方法的问题,并在我的博客上写了关于测试的文章。因为我习惯了最好的方法,所以我希望通过并行运行来加快进程。下面的结果是基于一个2ghz的i7 Mac和8gb的RAM。我很惊讶并行包,尤其是parSapply函数,比仅仅使用apply函数更糟糕。复制这个的代码如下所示。注意,我目前只使用创建的两个列中的一个,但最终希望同时使用这两个列。

Execution Time http://jason.bryer.org/images/ParalleVsApplyTiming.png

执行时间http://jason.bryer.org/images/ParalleVsApplyTiming.png

require(parallel)
require(ggplot2)
require(reshape2)
set.seed(2112)
results <- list()
sizes <- seq(1000, 30000, by=5000)
pb <- txtProgressBar(min=0, max=length(sizes), style=3)
for(cnt in 1:length(sizes)) {
    i <- sizes[cnt]
    df1 <- data.frame(row.names=1:i, 
                      var1=sample(c(TRUE,FALSE), i, replace=TRUE), 
                      var2=sample(1:10, i, replace=TRUE) )
    df2 <- data.frame(row.names=(i + 1):(i + i), 
                      var1=sample(c(TRUE,FALSE), i, replace=TRUE),
                      var2=sample(1:10, i, replace=TRUE))
    tm1 <- system.time({
        df6 <- sapply(df2$var1, FUN=function(x) { x == df1$var1 })
        dimnames(df6) <- list(row.names(df1), row.names(df2))
    })
    rm(df6)
    tm2 <- system.time({
        cl <- makeCluster(getOption('cl.cores', detectCores()))
        tm3 <- system.time({
            df7 <- parSapply(cl, df1$var1, FUN=function(x, df2) { x == df2$var1 }, df2=df2)
            dimnames(df7) <- list(row.names(df1), row.names(df2))
        })
        stopCluster(cl)
    })
    rm(df7)
    results[[cnt]] <- c(apply=tm1, parallel.total=tm2, parallel.exec=tm3)
    setTxtProgressBar(pb, cnt)
}

toplot <- as.data.frame(results)[,c('apply.user.self','parallel.total.user.self',
                          'parallel.exec.user.self')]
toplot$size <- sizes
toplot <- melt(toplot, id='size')

ggplot(toplot, aes(x=size, y=value, colour=variable)) + geom_line() + 
    xlab('Vector Size') + ylab('Time (seconds)')

3 个解决方案

#1


18  

Running jobs in parallel incurs overhead. Only if the jobs you fire at the worker nodes take a significant amount of time does parallelization improve overall performance. When the individual jobs take only milliseconds, the overhead of constantly firing off jobs will deteriorate overall performance. The trick is to divide the work over the nodes in such a way that the jobs are sufficiently long, say at least a few seconds. I used this to great effect running six Fortran models simultaneously, but these individual model runs took hours, almost negating the effect of overhead.

并行运行作业会带来开销。只有当您在工作节点上解雇的工作占用大量时间时,并行化才能提高整体性能。当单独的工作只需要几毫秒的时间时,持续解雇工作的开销将会降低整体性能。诀窍是在节点上划分工作,使作业足够长(至少几秒钟)。我使用了这个效果,同时运行了6个Fortran模型,但是这些单独的模型运行了几个小时,几乎抵消了开销的影响。

Note that I haven't run your example, but the situation I describe above is often the issue when parallization takes longer than running sequentially.

请注意,我还没有运行您的示例,但是我上面描述的情况通常是联合要比顺序运行花费更长的时间的问题。

#2


20  

These differences can be attributed to 1) communication overhead (especially if you run across nodes) and 2) performance overhead (if your job is not that intensive compared to initiating a parallelisation, for example). Usually, if the task you are parallelising is not that time-consuming, then you will mostly find that parallelisation does NOT have much of an effect (which is much highly visible on huge datasets.

这些差异可以归因于1)通信开销(尤其是在跨节点运行时)和2)性能开销(例如,如果您的工作与启动并行化相比没有那么密集的话)。通常,如果您正在并行化的任务不是那么耗时,那么您将会发现并行化并没有多大的效果(这在大型数据集中非常明显)。

Even though this may not directly answer your benchmarking, I hope this should be rather straightforward and can be related to. As an example, here, I construct a data.frame with 1e6 rows with 1e4 unique column group entries and some values in column val. And then I run using plyr in parallel using doMC and without parallelisation.

尽管这可能不能直接回答基准测试,但我希望这应该相当简单,并且可以与之相关。例如,在这里,我构造了一个data.frame,其中有1e6行,其中有1e4唯一列组项和一些列val中的值,然后我使用了plyr并行使用doMC,没有并行化。

df <- data.frame(group = as.factor(sample(1:1e4, 1e6, replace = T)), 
                 val = sample(1:10, 1e6, replace = T))
> head(df)
  group val
# 1  8498   8
# 2  5253   6
# 3  1495   1
# 4  7362   9
# 5  2344   6
# 6  5602   9

> dim(df)
# [1] 1000000       2

require(plyr)
require(doMC)
registerDoMC(20) # 20 processors

# parallelisation using doMC + plyr 
P.PLYR <- function() {
    o1 <- ddply(df, .(group), function(x) sum(x$val), .parallel = TRUE)
}

# no parallelisation
PLYR <- function() {
    o2 <- ddply(df, .(group), function(x) sum(x$val), .parallel = FALSE)
}

require(rbenchmark)
benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")

      test replications elapsed relative user.self sys.self user.child sys.child
2   PLYR()            2   8.925    1.000     8.865    0.068      0.000     0.000
1 P.PLYR()            2  30.637    3.433    15.841   13.945      8.944    38.858

As you can see, the parallel version of plyr runs 3.5 times slower

如您所见,并行版本的plyr运行速度要慢3.5倍

Now, let me use the same data.frame, but instead of computing sum, let me construct a bit more demanding function, say, median(.) * median(rnorm(1e4) ((meaningless, yes):

现在,让我使用相同的数据。框架,但不是计算和,让我构造一个更严格的函数,比如中位数(。)*中位数(rnorm(1e4)(无意义,是的):

You'll see that the tides are beginning to shift:

你会看到潮汐开始改变:

# parallelisation using doMC + plyr 
P.PLYR <- function() {
    o1 <- ddply(df, .(group), function(x) 
      median(x$val) * median(rnorm(1e4)), .parallel = TRUE)
}

# no parallelisation
PLYR <- function() {
    o2 <- ddply(df, .(group), function(x) 
         median(x$val) * median(rnorm(1e4)), .parallel = FALSE)
}

> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")
      test replications elapsed relative user.self sys.self user.child sys.child
1 P.PLYR()            2  41.911    1.000    15.265   15.369    141.585    34.254
2   PLYR()            2  73.417    1.752    73.372    0.052      0.000     0.000

Here, the parallel version is 1.752 times faster than the non-parallel version.

在这里,并行版本比非并行版本快1.752倍。

Edit: Following @Paul's comment, I just implemented a small delay using Sys.sleep(). Of course the results are obvious. But just for the sake of completeness, here's the result on a 20*2 data.frame:

编辑:根据@Paul的评论,我使用Sys.sleep()实现了一个小的延迟。当然,结果是显而易见的。但为了完整性起见,这里是20*2数据的结果。

df <- data.frame(group=sample(letters[1:5], 20, replace=T), val=sample(20))

# parallelisation using doMC + plyr 
P.PLYR <- function() {
    o1 <- ddply(df, .(group), function(x) {
    Sys.sleep(2)
    median(x$val)
    }, .parallel = TRUE)
}

# no parallelisation
PLYR <- function() {
    o2 <- ddply(df, .(group), function(x) {
        Sys.sleep(2)
        median(x$val)
    }, .parallel = FALSE)
}

> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")

#       test replications elapsed relative user.self sys.self user.child sys.child
# 1 P.PLYR()            2   4.116    1.000     0.056    0.056      0.024      0.04
# 2   PLYR()            2  20.050    4.871     0.028    0.000      0.000      0.00

The difference here is not surprising.

这里的差异并不令人惊讶。

#3


7  

Completely agree with @Arun and @PaulHiemestra arguments concerning Why...? part of your question.

完全同意@Arun和@PaulHiemestra关于为什么…?你的问题的一部分。

However, it seems that you can take some benefits from parallel package in your situation (at least if you are not stuck with Windows). Possible solution is using mclapply instead of parSapply, which relies on fast forking and shared memory.

但是,似乎您可以从并行包中获得一些好处(至少如果您没有陷入Windows)。可能的解决方案是使用mclapply而不是parSapply,它依赖于快速的forking和共享内存。

  tm2 <- system.time({
    tm3 <- system.time({
     df7 <- matrix(unlist(mclapply(df2$var1, FUN=function(x) {x==df1$var1}, mc.cores=8)), nrow=i)
     dimnames(df7) <- list(row.names(df1), row.names(df2))
    })
  })

Of course, nested system.time is not needed here. With my 2 cores I got:

当然,嵌套系统。这里不需要时间。我有两个核心

为什么并行包比仅仅使用apply要慢?

#1


18  

Running jobs in parallel incurs overhead. Only if the jobs you fire at the worker nodes take a significant amount of time does parallelization improve overall performance. When the individual jobs take only milliseconds, the overhead of constantly firing off jobs will deteriorate overall performance. The trick is to divide the work over the nodes in such a way that the jobs are sufficiently long, say at least a few seconds. I used this to great effect running six Fortran models simultaneously, but these individual model runs took hours, almost negating the effect of overhead.

并行运行作业会带来开销。只有当您在工作节点上解雇的工作占用大量时间时,并行化才能提高整体性能。当单独的工作只需要几毫秒的时间时,持续解雇工作的开销将会降低整体性能。诀窍是在节点上划分工作,使作业足够长(至少几秒钟)。我使用了这个效果,同时运行了6个Fortran模型,但是这些单独的模型运行了几个小时,几乎抵消了开销的影响。

Note that I haven't run your example, but the situation I describe above is often the issue when parallization takes longer than running sequentially.

请注意,我还没有运行您的示例,但是我上面描述的情况通常是联合要比顺序运行花费更长的时间的问题。

#2


20  

These differences can be attributed to 1) communication overhead (especially if you run across nodes) and 2) performance overhead (if your job is not that intensive compared to initiating a parallelisation, for example). Usually, if the task you are parallelising is not that time-consuming, then you will mostly find that parallelisation does NOT have much of an effect (which is much highly visible on huge datasets.

这些差异可以归因于1)通信开销(尤其是在跨节点运行时)和2)性能开销(例如,如果您的工作与启动并行化相比没有那么密集的话)。通常,如果您正在并行化的任务不是那么耗时,那么您将会发现并行化并没有多大的效果(这在大型数据集中非常明显)。

Even though this may not directly answer your benchmarking, I hope this should be rather straightforward and can be related to. As an example, here, I construct a data.frame with 1e6 rows with 1e4 unique column group entries and some values in column val. And then I run using plyr in parallel using doMC and without parallelisation.

尽管这可能不能直接回答基准测试,但我希望这应该相当简单,并且可以与之相关。例如,在这里,我构造了一个data.frame,其中有1e6行,其中有1e4唯一列组项和一些列val中的值,然后我使用了plyr并行使用doMC,没有并行化。

df <- data.frame(group = as.factor(sample(1:1e4, 1e6, replace = T)), 
                 val = sample(1:10, 1e6, replace = T))
> head(df)
  group val
# 1  8498   8
# 2  5253   6
# 3  1495   1
# 4  7362   9
# 5  2344   6
# 6  5602   9

> dim(df)
# [1] 1000000       2

require(plyr)
require(doMC)
registerDoMC(20) # 20 processors

# parallelisation using doMC + plyr 
P.PLYR <- function() {
    o1 <- ddply(df, .(group), function(x) sum(x$val), .parallel = TRUE)
}

# no parallelisation
PLYR <- function() {
    o2 <- ddply(df, .(group), function(x) sum(x$val), .parallel = FALSE)
}

require(rbenchmark)
benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")

      test replications elapsed relative user.self sys.self user.child sys.child
2   PLYR()            2   8.925    1.000     8.865    0.068      0.000     0.000
1 P.PLYR()            2  30.637    3.433    15.841   13.945      8.944    38.858

As you can see, the parallel version of plyr runs 3.5 times slower

如您所见,并行版本的plyr运行速度要慢3.5倍

Now, let me use the same data.frame, but instead of computing sum, let me construct a bit more demanding function, say, median(.) * median(rnorm(1e4) ((meaningless, yes):

现在,让我使用相同的数据。框架,但不是计算和,让我构造一个更严格的函数,比如中位数(。)*中位数(rnorm(1e4)(无意义,是的):

You'll see that the tides are beginning to shift:

你会看到潮汐开始改变:

# parallelisation using doMC + plyr 
P.PLYR <- function() {
    o1 <- ddply(df, .(group), function(x) 
      median(x$val) * median(rnorm(1e4)), .parallel = TRUE)
}

# no parallelisation
PLYR <- function() {
    o2 <- ddply(df, .(group), function(x) 
         median(x$val) * median(rnorm(1e4)), .parallel = FALSE)
}

> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")
      test replications elapsed relative user.self sys.self user.child sys.child
1 P.PLYR()            2  41.911    1.000    15.265   15.369    141.585    34.254
2   PLYR()            2  73.417    1.752    73.372    0.052      0.000     0.000

Here, the parallel version is 1.752 times faster than the non-parallel version.

在这里,并行版本比非并行版本快1.752倍。

Edit: Following @Paul's comment, I just implemented a small delay using Sys.sleep(). Of course the results are obvious. But just for the sake of completeness, here's the result on a 20*2 data.frame:

编辑:根据@Paul的评论,我使用Sys.sleep()实现了一个小的延迟。当然,结果是显而易见的。但为了完整性起见,这里是20*2数据的结果。

df <- data.frame(group=sample(letters[1:5], 20, replace=T), val=sample(20))

# parallelisation using doMC + plyr 
P.PLYR <- function() {
    o1 <- ddply(df, .(group), function(x) {
    Sys.sleep(2)
    median(x$val)
    }, .parallel = TRUE)
}

# no parallelisation
PLYR <- function() {
    o2 <- ddply(df, .(group), function(x) {
        Sys.sleep(2)
        median(x$val)
    }, .parallel = FALSE)
}

> benchmark(P.PLYR(), PLYR(), replications = 2, order = "elapsed")

#       test replications elapsed relative user.self sys.self user.child sys.child
# 1 P.PLYR()            2   4.116    1.000     0.056    0.056      0.024      0.04
# 2   PLYR()            2  20.050    4.871     0.028    0.000      0.000      0.00

The difference here is not surprising.

这里的差异并不令人惊讶。

#3


7  

Completely agree with @Arun and @PaulHiemestra arguments concerning Why...? part of your question.

完全同意@Arun和@PaulHiemestra关于为什么…?你的问题的一部分。

However, it seems that you can take some benefits from parallel package in your situation (at least if you are not stuck with Windows). Possible solution is using mclapply instead of parSapply, which relies on fast forking and shared memory.

但是,似乎您可以从并行包中获得一些好处(至少如果您没有陷入Windows)。可能的解决方案是使用mclapply而不是parSapply,它依赖于快速的forking和共享内存。

  tm2 <- system.time({
    tm3 <- system.time({
     df7 <- matrix(unlist(mclapply(df2$var1, FUN=function(x) {x==df1$var1}, mc.cores=8)), nrow=i)
     dimnames(df7) <- list(row.names(df1), row.names(df2))
    })
  })

Of course, nested system.time is not needed here. With my 2 cores I got:

当然,嵌套系统。这里不需要时间。我有两个核心

为什么并行包比仅仅使用apply要慢?