为什么Cython比矢量化的NumPy慢?

Consider the following Cython code :

考虑以下Cython代码:

cimport cython
cimport numpy as np
import numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

def test_numpyvec(a, b):
    a += b

def gendata(nb=40000000):
    a = np.random.random(nb)
    b = np.random.random(nb)
    return a, b

Running it in the interpreter yields (after a few runs to warm up the cache) :

在解释器中运行它会产生(运行几次以预热缓存):

In [14]: %timeit -n 100 test_memoryview(a, b)
100 loops, best of 3: 148 ms per loop

In [15]: %timeit -n 100 test_numpy(a, b)
100 loops, best of 3: 159 ms per loop

In [16]: %timeit -n 100 test_numpyvec(a, b)
100 loops, best of 3: 124 ms per loop

# See answer below :
In [17]: %timeit -n 100 test_raw_pointers(a, b)
100 loops, best of 3: 129 ms per loop

I tried it with different dataset sizes, and consistently had the vectorized NumPy function run faster than the compiled Cython code, while I was expecting Cython to be on par with vectorized NumPy in terms of performance.

我尝试使用不同的数据集大小，并且始终让向量化的NumPy函数比编译的Cython代码运行得更快，而我期望Cython在性能上与向量化的NumPy相同。

Did I forget an optimization in my Cython code? Does NumPy use something (BLAS?) in order to make such simple operations run faster? Can I improve the performance of this code?

我是否忘记了Cython代码中的优化?NumPy是否使用一些东西(BLAS?)来使这样的简单操作运行得更快?我可以改进这段代码的性能吗?

Update: The raw pointer version seems to be on par with NumPy. So apparently there's some overhead in using memory view or NumPy indexing.

更新:原始指针版本似乎与NumPy一致。显然，使用内存视图或NumPy索引会有一些开销。

3 个解决方案

#1

Another option is to use raw pointers (and the global directives to avoid repeating @cython...):

另一种选择是使用原始指针(以及全局指令以避免重复@cython…):

#cython: wraparound=False
#cython: boundscheck=False
#cython: nonecheck=False

#...

cdef ctest_raw_pointers(int n, double *a, double *b):
    cdef int i
    for i in range(n):
        a[i] += b[i]

def test_raw_pointers(np.ndarray[double, ndim=1] a, np.ndarray[double, ndim=1] b):
    ctest_raw_pointers(a.shape[0], &a[0], &b[0])

#2

On my machine the difference isn't as large, but I can nearly eliminate it by changing the numpy and memory view functions like this

在我的机器上，差异不是很大，但是我可以通过像这样修改numpy和内存视图函数来消除它

@cython.boundscheck(False)
@cython.wraparound(False)
def test_memoryview(double[:] a, double[:] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

@cython.boundscheck(False)
@cython.wraparound(False)
def test_numpy(np.ndarray[double] a, np.ndarray[double] b):
    cdef int i, n=a.shape[0]
    for i in range(n):
        a[i] += b[i]

and then, when I compile the C output from Cython, I use the flags -O3 and -march=native. This seems to indicate that the difference in timings comes from the use of different compiler optimizations.

然后，当我从Cython编译C输出时，我使用标记-O3和-march=native。这似乎表明计时的不同来自于使用不同的编译器优化。

I use the 64 bit version of MinGW and NumPy 1.8.1. Your results will probably vary depending on your package versions, hardware, platform, and compiler.

我使用64位版本的MinGW和NumPy 1.8.1。根据包版本、硬件、平台和编译器的不同，您的结果可能会有所不同。

If you are using the IPython notebook's Cython magic, you can force an update with the additional compiler flags by replacing %%cython with %%cython -f -c=-O3 -c=-march=native

如果您正在使用IPython笔记本的Cython魔法，您可以通过用% Cython -f -c=-O3 -c=-march=本机替换% Cython %% %来强制使用附加的编译器标志进行更新

If you are using a standard setup.py for your cython module you can specify the extra_compile_args argument when creating the Extension object that you pass to distutils.setup.

如果您正在使用标准设置。对于cython模块，可以在创建传递给distutil .setup. setup的扩展对象时指定extra_compile_args参数。

Note: I removed the ndim=1 flag when specifying the types for the NumPy arrays because it isn't necessary. That value defaults to 1 anyway.

注意:我在为NumPy数组指定类型时删除了ndim=1标志，因为它不是必需的。不管怎样，这个值默认为1。

#3

A change that slightly increases the speed is to specify the stride:

稍微提高速度的变化是指定步幅:

def test_memoryview_inorder(double[::1] a, double[::1] b):
    cdef int i
    for i in range(a.shape[0]):
        a[i] += b[i]

#1