当使用atlas和openblas进行基准测试时，会产生奇怪的结果。

I try to evalaute the performance of numpy linked to ATLAS compared to numpy linked to OpenBLAS. I get some strange results for ATLAS which I describe below.

我试着去验证连接到ATLAS的numpy和连接到OpenBLAS的numpy的性能。我得到一些奇怪的结果为阿特拉斯，我下面描述。

The Python code for evaluating matrix-matrix multiplication (aka sgemm) looks like this:

用于计算矩阵-矩阵乘法的Python代码如下:

import sys
sys.path.insert(0, "numpy-1.8.1")

import numpy
import timeit

for i in range(100, 501, 100):
    setup = "import numpy; m1 = numpy.random.rand(%d, %d).astype(numpy.float32)" % (i, i)
    timer = timeit.Timer("numpy.dot(m1, m1)", setup)
    times = timer.repeat(100, 1)
    print "%3d" % i,
    print "%7.4f" % numpy.mean(times),
    print "%7.4f" % numpy.min(times),
    print "%7.4f" % numpy.max(times)

If I run this script with numpy linked to ATLAS I get large variations in the measured time. You see the matrix size in the frist column, followed by mean, min and max of execution times gained by running the matrix matrix multiplication 100 fold:

如果我使用连接到ATLAS的numpy运行这个脚本，我将得到测量时间的巨大变化。你可以在第一列看到矩阵的大小，然后是平均，最小和最大的执行时间，通过100倍的矩阵乘法得到:

100  0.0003  0.0003  0.0004
200  0.0023  0.0010  0.0073
300  0.0052  0.0026  0.0178
400  0.0148  0.0066  0.0283
500  0.0295  0.0169  0.0531

If I repeat this procedure with numpy linked to OpenBLAS using one thread the running times are much more stable:

如果我用一个线程连接到OpenBLAS的numpy重复这个过程，那么运行时间就会稳定得多:

100  0.0002  0.0002  0.0003
200  0.0014  0.0014  0.0015
300  0.0044  0.0044  0.0047
400  0.0102  0.0101  0.0105
500  0.0169  0.0168  0.0177

Can anybody explane this observation ?

有人能把这个观察结果公布出来吗?

Edit: Additional information:

编辑:附加信息:

The oberved min and max values for ATLAS are no outliers, the times are distributed over the given range.

ATLAS的最小值和最大值都不是异常值，时间分布在给定的范围内。

I uploaded ATALS times for i=500 at https://gist.github.com/uweschmitt/768bd165477d7c14095e

我在https://gist.github.com/uweschmitt/768bd165477d7c14095e上了I =500的ATALS times

The given times come from a different run, so avg, min and max values differ slightly.

给定的时间来自不同的运行，所以avg、min和max值略有不同。

Edit: Additional finding:

编辑:额外的发现:

May CPU Throttling (http://www.scipy.org/scipylib/building/linux.html#step-1-disable-cpu-throttling) be the cause ? I do not know enough about CPU throtting in order to judge its impact on my measurements. Regrettably I can not set / unset it on my target machine.

CPU节流(http://www.scipy.org/scipylib/building/linux.html#step- disable- CPU - Throttling)是什么原因?为了判断CPU节流对测量的影响，我对CPU节流的了解还不够。很遗憾，我不能在我的目标机器上设置它。

1 个解决方案

#1

I cannot reproduce, but I think I know the reason. I am using Numpy 1.8.1 on a Linux 64 box.

我不能复制，但我想我知道原因。我正在Linux 64盒上使用Numpy 1.8.1。

First, my results with ATLAS (I have added the standard deviation in the last column):

首先，我在ATLAS中的结果(我在最后一列增加了标准差):

100  0.0003  0.0002  0.0025  0.0003
200  0.0012  0.0010  0.0067  0.0006
300  0.0028  0.0026  0.0047  0.0004
400  0.0070  0.0059  0.0089  0.0004
500  0.0122  0.0109  0.0149  0.0009

And now, the results with MKL provided by Anaconda:

现在，由Anaconda提供的MKL结果:

100  0.0003  0.0001  0.0155  0.0015
200  0.0005  0.0005  0.0006  0.0000
300  0.0018  0.0017  0.0021  0.0001
400  0.0039  0.0038  0.0042  0.0001
500  0.0079  0.0077  0.0084  0.0002

MKL is faster, but the spread is consistent.

MKL速度更快，但传播是一致的。

ATLAS is tuned at compile time, it will try different configurations and algorithms and keep the fastest for your particular set of hardware. If you install a precompiled version, you are using the optimal configuration for the building machine, not for yours. This misconfiguration is the probable cause of the spread. In my case, I have compiled ATLAS myself.

ATLAS在编译时进行调优，它将尝试不同的配置和算法，并为您的特定硬件集保持最快的速度。如果您安装了预编译版本，那么您是在为构建机器使用最优配置，而不是为您的机器。这种错误的配置可能是传播的原因。在我的情况下，我自己编写了《阿特拉斯》。

On the contrary, OpenBLAS is hand tuned to the specific architecture, so any binary install will be equivalent. MKL decides dynamically.

相反，OpenBLAS是手动调优到特定架构的，因此任何二进制安装都是等效的。MKL动态决定。

This is what happens if I run the script on Numpy installed from the repositories and linked with a pre-compiled ATLAS (SSE3 not activated):

如果我在从存储库安装的Numpy上运行脚本并链接到预编译的ATLAS (SSE3未激活)，则会发生以下情况:

100  0.0007  0.0003  0.0064  0.0007
200  0.0021  0.0015  0.0090  0.0009
300  0.0050  0.0040  0.0114  0.0010
400  0.0113  0.0101  0.0186  0.0011
500  0.0217  0.0192  0.0329  0.0020

These numbers are more similar to your data.

这些数字更类似于你的数据。

For completeness, I aksed a friend to run the snippet on her machine, that has numpy installed from Ubuntu repositories and no ATLAS, so Numpy is falling back to its crappy default:

为了完整起见，我让一个朋友在她的机器上运行代码片段，它有从Ubuntu存储库安装的numpy，没有ATLAS，所以numpy回到了它糟糕的默认值:

100  0.0007  0.0007  0.0008  0.0000
200  0.0058  0.0053  0.0107  0.0014
300  0.0178  0.0175  0.0188  0.0003
400  0.0418  0.0401  0.0528  0.0014
500  0.0803  0.0797  0.0818  0.0004

So, what may be happening?

那么，会发生什么呢?

You have a non optimal installation of ATLAS, and that is why you get such a scatter. My numbers were run on a Intel i5 CPU @ 1.7 GHz on a laptop. I don't know which machine you have, but I doubt it is almost three times slower than mine. This suggest ATLAS is not fully optimised.

你有一个不理想的阿特拉斯安装，这就是为什么你得到如此分散。我的数据运行在英特尔i5 CPU上@ 1.7 GHz的笔记本上。我不知道你有哪台机器，但我怀疑它的速度几乎是我的三倍。这表明ATLAS没有完全优化。

How can I be sure?

我怎么能确定呢?

Running numpy.show_config() will tell you which libraries it is linked to, and where they are. The output is something like this:

运行numpy.show_config()将告诉您它链接到哪些库，以及它们在哪里。输出是这样的:

atlas_threads_info:
    libraries = ['lapack', 'ptf77blas', 'ptcblas', 'atlas']
    library_dirs = ['/usr/lib64/atlas-sse3']
    define_macros = [('ATLAS_INFO', '"\\"3.8.4\\""')]
    language = f77
    include_dirs = ['/usr/include']
blas_opt_info:

If this is true, how to fix it?

如果这是真的，如何修复?

You may have a stale precompiled binary atlas (it is a dependency for some packages), or the flags you used to compile it are wrong. The smoothest solution is to build the RMPS from source. Here are instructions for CentOS.

您可能有一个过时的预编译二进制图集(它是某些包的依赖项)，或者您用来编译它的标志是错误的。最平滑的解决方案是从源代码构建RMPS。这是CentOS的说明。

Note that OpenBLAS is not compatible (yet) with multiprocessing, so be aware of the limitations. If you are very heavy on linear algebra, MKL is the best option, but it is expensive. Academics can get it for free from Continuum Anaconda Python distribution, and many universities have a campus-wide licence.

注意，OpenBLAS与多处理不兼容(还不兼容)，所以要注意限制。如果你对线性代数很感兴趣，MKL是最好的选择，但是它很贵。学者们可以从连续的蟒蛇(Anaconda Python)发行版本中免费获得它，而且许多大学都有整个校园的许可证。

#1