如何提高Haskell数值计算的性能?

I'm in the middle of porting David Blei's original C implementation of Latent Dirichlet Allocation to Haskell, and I'm trying to decide whether to leave some of the low-level stuff in C. The following function is one example—it's an approximation of the second derivative of lgamma:

我正在将David Blei的潜在Dirichlet分配的原始C实现移植到Haskell中，我正在尝试决定是否把一些底层的东西留在C中。

double trigamma(double x)
{
    double p;
    int i;

    x=x+6;
    p=1/(x*x);
    p=(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238)
         *p-0.033333333333333)*p+0.166666666666667)*p+1)/x+0.5*p;
    for (i=0; i<6 ;i++)
    {
        x=x-1;
        p=1/(x*x)+p;
    }
    return(p);
}

I've translated this into more or less idiomatic Haskell as follows:

我把它或多或少地翻译成如下的习惯Haskell:

trigamma :: Double -> Double
trigamma x = snd $ last $ take 7 $ iterate next (x' - 1, p')
  where
    x' = x + 6
    p  = 1 / x' ^ 2
    p' = p / 2 + c / x'
    c  = foldr1 (\a b -> (a + b * p)) [1, 1/6, -1/30, 1/42, -1/30, 5/66]
    next (x, p) = (x - 1, 1 / x ^ 2 + p)

The problem is that when I run both through Criterion, my Haskell version is six or seven times slower (I'm compiling with -O2 on GHC 6.12.1). Some similar functions are even worse.

问题是，当我同时运行这两个条件时，Haskell版本的速度要慢六到七倍(我在GHC 6.12.1上使用-O2编译)。有些类似的函数甚至更糟。

I know practically nothing about Haskell performance, and I'm not terribly interested in digging through Core or anything like that, since I can always just call the handful of math-intensive C functions through FFI.

实际上，我对Haskell的性能一无所知，而且我对深入研究Core或类似的东西也没什么兴趣，因为我总是可以通过FFI调用大量的数学密集型C函数。

But I'm curious about whether there's low-hanging fruit that I'm missing—some kind of extension or library or annotation that I could use to speed up this numeric stuff without making it too ugly.

但我很好奇，我是否错过了一些容易挂起的果实——某种扩展、库或注释，我可以用它们来加速这个数字，而不会让它变得太难看。

UPDATE: Here are two better solutions, thanks to Don Stewart and Yitz. I've modified Yitz's answer slightly to use Data.Vector.

更新:这里有两个更好的解决方案，感谢Don Stewart和Yitz。我稍微修改了Yitz的答案，使用Data.Vector。

invSq x = 1 / (x * x)
computeP x = (((((5/66*p-1/30)*p+1/42)*p-1/30)*p+1/6)*p+1)/x+0.5*p
  where p = invSq x

trigamma_d :: Double -> Double
trigamma_d x = go 0 (x + 5) $ computeP $ x + 6
  where
    go :: Int -> Double -> Double -> Double
    go !i !x !p
        | i >= 6    = p
        | otherwise = go (i+1) (x-1) (1 / (x*x) + p)

trigamma_y :: Double -> Double
trigamma_y x = V.foldl' (+) (computeP $ x + 6) $ V.map invSq $ V.enumFromN x 6

The performance of the two seems to be almost exactly the same, with one or the other winning by a percentage point or two depending on the compiler flags.

两者的性能似乎几乎完全相同，其中一个或另一个根据编译器标志获得一个或两个百分点的胜利。

As camccann said over at Reddit, the moral of the story is "For best results, use Don Stewart as your GHC backend code generator." Barring that solution, the safest bet seems to be just to translate the C control structures directly into Haskell, although loop fusion can give similar performance in a more idiomatic style.

正如camccann在Reddit上所说，这个故事的寓意是“为了获得最好的结果，请使用Don Stewart作为您的GHC后端代码生成器。”除了这个解决方案，最安全的方法似乎是直接将C控制结构转换成Haskell，尽管循环融合可以使类似的性能以更惯用的方式进行。

I'll probably end up using the Data.Vector approach in my code.

最后我可能会用到这些数据。向量方法在我的代码。

2 个解决方案

#1

Use the same control and data structures, yielding:

使用相同的控制和数据结构，产生:

{-# LANGUAGE BangPatterns #-}
{-# OPTIONS_GHC -fvia-C -optc-O3 -fexcess-precision -optc-march=native #-}

{-# INLINE trigamma #-}
trigamma :: Double -> Double
trigamma x = go 0 (x' - 1) p'
    where
        x' = x + 6
        p  = 1 / (x' * x')

        p' =(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238)
                  *p-0.033333333333333)*p+0.166666666666667)*p+1)/x'+0.5*p

        go :: Int -> Double -> Double -> Double
        go !i !x !p
            | i >= 6    = p
            | otherwise = go (i+1) (x-1) (1 / (x*x) + p)

I don't have your testsuite, but this yields the following asm:

我没有您的testsuite，但是这会产生以下asm:

A_zdwgo_info:
        cmpq    $5, %r14
        jg      .L3
        movsd   .LC0(%rip), %xmm7
        movapd  %xmm5, %xmm8
        movapd  %xmm7, %xmm9
        mulsd   %xmm5, %xmm8
        leaq    1(%r14), %r14
        divsd   %xmm8, %xmm9
        subsd   %xmm7, %xmm5
        addsd   %xmm9, %xmm6
        jmp     A_zdwgo_info

Which looks ok. This is the kind of code the -fllvm backend does a good job.

看起来好。这是一种-fllvm后端的代码。

GCC unrolls the loop though, and the only way to do that is either via Template Haskell or manual unrolling. You might consider that (a TH macro) if doing a lot of this.

不过，GCC打开循环，惟一的方法是通过模板Haskell或手动展开。如果你做了很多这样的事情，你可能会认为这是(第一个宏)。

Actually, the GHC LLVM backend does unroll the loop :-)

实际上，GHC LLVM后端会展开循环:-)

Finally, if you really like the original Haskell version, write it using stream fusion combinators, and GHC will convert it back into loops. (Exercise for the reader).

最后，如果您真的喜欢最初的Haskell版本，那么使用流融合组合器编写它，GHC将把它转换回循环。读者(练习)。

#2

Before the optimization work, I wouldn't say that your original translation is the most idiomatic way to express in Haskell what the C code is doing.

在进行优化工作之前，我不会说您的原始翻译是在Haskell中表示C代码正在做什么的最惯用的方式。

How would the optimization process have proceeded if we started with the following instead:

如果我们以下列方式开始，优化过程将如何进行:

trigamma :: Double -> Double
trigamma x = foldl' (+) p' . map invSq . take 6 . iterate (+ 1) $ x
where
  invSq y = 1 / (y * y)
  x' = x + 6
  p  = invSq x'
  p' =(((((0.075757575757576*p-0.033333333333333)*p+0.0238095238095238)
              *p-0.033333333333333)*p+0.166666666666667)*p+1)/x'+0.5*p

#1