如何提高Haskell程序的性能?

时间:2022-06-05 06:22:31

I'm working through the problems in Project Euler as a way of learning Haskell, and I find that my programs are a lot slower than a comparable C version, even when compiled. What can I do to speed up my Haskell programs?

作为学习Haskell的一种方式,我正在解决Project Euler中的问题,我发现我的程序比可比的C版本慢得多,即使编译后也是如此。我能做什么来加速我的Haskell程序?

For example, my brute-force solution to Problem 14 is:

例如,我对问题14的蛮力解是:

import Data.Int
import Data.Ord
import Data.List

searchTo = 1000000

nextNumber :: Int64 -> Int64
nextNumber n
    | even n    = n `div` 2
    | otherwise = 3 * n + 1

sequenceLength :: Int64 -> Int
sequenceLength 1 = 1
sequenceLength n = 1 + (sequenceLength next)
    where next = nextNumber n

longestSequence = maximumBy (comparing sequenceLength) [1..searchTo]

main = putStrLn $ show $ longestSequence

Which takes around 220 seconds, while an "equivalent" brute-force C version only takes 1.2 seconds.

这大约需要220秒,而“等效”蛮力C版本只需要1.2秒。

#include <stdio.h>

int main(int argc, char **argv)
{
    int longest = 0;
    int terms = 0;
    int i;
    unsigned long j;

    for (i = 1; i <= 1000000; i++)
    {
        j = i;
        int this_terms = 1;

        while (j != 1)
        {
            this_terms++;

            if (this_terms > terms)
            {
                terms = this_terms;
                longest = i;
            }

            if (j % 2 == 0)
                j = j / 2;
            else
                j = 3 * j + 1;
        }
    }

    printf("%d\n", longest);
    return 0;
}

What am I doing wrong? Or am I naive to think that Haskell could even approach C's speed?

我做错了什么?或者我是否天真地认为Haskell甚至可以接近C的速度?

(I'm compiling the C version with gcc -O2, and the Haskell version with ghc --make -O).

(我用gcc -O2编译C版本,用ghc编译Haskell版本——make -O)。

5 个解决方案

#1


24  

For testing purpose I have just set searchTo = 100000. The time taken is 7.34s. A few modification leads to some big improvement:

为了测试目的,我设置了searchTo = 100000。所花的时间是7.34秒。一些修改导致了一些大的改进:

  1. Use an Integer instead of Int64. This improves the time to 1.75s.

    使用整数代替Int64。这将时间提高到1.75秒。

  2. Use an accumulator (you don't need sequenceLength to be lazy right?) 1.54s.

    使用一个累加器(你不需要sequenceLength来表示懒惰,对吧?)

    seqLen2 :: Int -> Integer -> Int
    seqLen2 a 1 = a
    seqLen2 a n = seqLen2 (a+1) (nextNumber n)
    
    sequenceLength :: Integer -> Int
    sequenceLength = seqLen2 1
    
  3. Rewrite the nextNumber using quotRem, thus avoiding computing the division twice (once in even and once in div). 1.27s.

    使用quotRem重写下一个数字,从而避免计算两次除法(一次为偶数,一次为div)。1.27 s。

    nextNumber :: Integer -> Integer
    nextNumber n 
        | r == 0    = q
        | otherwise = 6*q + 4
        where (q,r) = quotRem n 2 
    
  4. Use Schwartzian transform instead of maximumBy. The problem of maximumBy . comparing is that the sequenceLength function is called more than once for each value. 0.32s.

    用Schwartzian变换代替maximumBy。极大值问题。比较的是,对每个值调用sequenceLength函数的次数不止一次。0.32 s。

    longestSequence = snd $ maximum [(sequenceLength a, a) | a <- [1..searchTo]]
    

Note:

注意:

  • I check the time by compiling with ghc -O and run with +RTS -s)
  • 我使用ghc -O编译并使用+RTS -s运行来检查时间)
  • My machine is running on Mac OS X 10.6. The GHC version is 6.12.2. The compiled file is in i386 architecture.)
  • 我的机器在Mac OS X 10.6上运行。GHC版本是6.12.2。编译后的文件在i386架构中。
  • The C problem runs at 0.078s with the corresponding parameter. It is compiled with gcc -O3 -m32.
  • C问题的参数为0.078s。它是用gcc -O3 -m32编译的。

#2


11  

Although this is already rather old, let me chime in, there's one crucial point that hasn't been addressed before.

虽然这已经是相当古老的了,让我插嘴说一下,有一个关键的点以前没有被讨论过。

First, the timings of the different programmes on my box. Since I'm on a 64-bit linux system, they show somewhat different characteristics: using Integer instead of Int64 does not improve performance as it would with a 32-bit GHC, where each Int64 operation would incur the cost of a C-call while the computations with Integers fitting in signed 32-bit integers don't need a foreign call (since only few operations exceed that range here, Integer is the better choice on a 32-bit GHC).

首先,我的盒子上的不同节目的时间。因为我在64位的linux系统上,他们表现出不同的特点:使用整数代替Int64不提高性能,因为它将与一个32位的GHC每个Int64操作会招致的成本c调用而签署的与整数计算拟合32位整数不需要外国电话(因为只有几个操作超过这个范围,在一个32位的整数是更好的选择GHC)。

  • C: 0.3 seconds
  • C:0.3秒
  • Original Haskell: 14.24 seconds, using Integer instead of Int64: 33.96 seconds
  • 原始Haskell: 14.24秒,使用整数代替Int64: 33.96秒
  • KennyTM's improved version: 5.55 seconds, using Int: 1.85 seconds
  • KennyTM的改进版本:5.55秒,使用Int: 1.85秒
  • Chris Kuklewicz's version: 5.73 seconds, using Int: 1.90 seconds
  • Chris Kuklewicz的版本:5.73秒,使用Int: 1.90秒
  • FUZxxl's version: 3.56 seconds, using quotRem instead of divMod: 1.79 seconds
  • FUZxxl版本:3.56秒,用quotRem代替divMod: 1.79秒

So what have we?

所以我们什么?

  1. Calculate the length with an accumulator so the compiler can transform it (basically) into a loop
  2. 使用累加器计算长度,这样编译器就可以(基本上)将其转换为循环
  3. Don't recalculate the sequence lengths for the comparisons
  4. 不要为比较重新计算序列长度
  5. Don't use div resp. divMod when it's not necessary, quot resp. quotRem are much faster
  6. 不要使用div职责。在不必要的时候离开," resp "。quotRem要快得多

What is still missing?

失踪是什么?

if (j % 2 == 0)
    j = j / 2;
else
    j = 3 * j + 1;

Any C compiler I have used transforms the test j % 2 == 0 into a bit-masking and doesn't use a division instruction. GHC does not (yet) do that. So testing even n or computing n `quotRem` 2 is quite an expensive operation. Replacing nextNumber in KennyTM's Integer version with

我使用的任何C编译器都将测试j % 2 == 0转换为位屏蔽,并且不使用除法指令。GHC还没有这么做。所以即使是测试n或计算n ' quotRem ' 2都是非常昂贵的操作。在KennyTM的整数版本中替换nextNumber。

nextNumber :: Integer -> Integer
nextNumber n
    | fromInteger n .&. 1 == (0 :: Int) = n `quot` 2
    | otherwise = 3*n+1

reduces its running time to 3.25 seconds (Note: for Integer, n `quot` 2 is faster than n `shiftR` 1, that takes 12.69 seconds!).

将其运行时间减少到3.25秒(注意:对于整数,n ' ' ' 2比n ' shiftR ' 1快,这需要12.69秒!)

Doing the same in the Int version reduces its running time to 0.41 seconds. For Ints, the bit-shift for division by 2 is a bit faster than the quot operation, reducing its running time to 0.39 seconds.

在Int版本中执行相同操作将运行时间减少到0.41秒。对于Ints,除2的位移比“操作”快一点,将其运行时间减少到0.39秒。

Eliminating the construction of the list (that doesn't appear in the C version either),

消除列表的构造(在C版本中也没有出现),

module Main (main) where

import Data.Bits

result :: Int
result = findMax 0 0 1

findMax :: Int -> Int -> Int -> Int
findMax start len can
    | can > 1000000 = start
    | canlen > len = findMax can canlen (can+1)
    | otherwise = findMax start len (can+1)
      where
        canlen = findLen 1 can

findLen :: Int -> Int -> Int
findLen l 1 = l
findLen l n
    | n .&. 1 == 0  = findLen (l+1) (n `shiftR` 1)
    | otherwise     = findLen (l+1) (3*n+1)

main :: IO ()
main = print result

yields a further small speedup, resulting in a running time of 0.37 seconds.

产生进一步的小加速,导致运行时间为0.37秒。

So the Haskell version that's in close correspondence to the C version doesn't take that much longer, it's a factor of ~1.3.

所以Haskell版本和C版本很接近,不会花太长时间,它的因数是1。3。

Well, let's be fair, there's an inefficiency in the C version that's not present in the Haskell versions,

让我们公平地说,C版本有一个低效率在Haskell版本中没有,

if (this_terms > terms)
{
    terms = this_terms;
    longest = i;
}

appearing in the inner loop. Lifting that out of the inner loop in the C version reduces its running time to 0.27 seconds, making the factor ~1.4.

出现在内部循环中。在C版本中从内部循环中提升这个参数,可以将运行时间减少到0.27秒,使系数变为1.4。

#3


5  

The comparing may be recomputing sequenceLength too much. This is my best version:

比较可能是重新计算序列太多了。这是我最好的版本:

type I = Integer
data P = P {-# UNPACK #-} !Int {-# UNPACK #-} !I deriving (Eq,Ord,Show)

searchTo = 1000000

nextNumber :: I -> I
nextNumber n = case quotRem n 2 of
                  (n2,0) -> n2
                  _ -> 3*n+1

sequenceLength :: I -> Int
sequenceLength x = count x 1 where
  count 1 acc = acc
  count n acc = count (nextNumber n) (succ acc)

longestSequence = maximum . map (\i -> P (sequenceLength i) i) $ [1..searchTo]

main = putStrLn $ show $ longestSequence

The answer and timing are slower than C, but it does use arbitrary precision integers (through the Integer type):

答案和时间都比C慢,但它确实使用了任意精度整数(通过整数类型):

ghc -O2 --make euler14-fgij.hs
time ./euler14-fgij
P 525 837799

real 0m3.235s
user 0m3.184s
sys  0m0.015s

#4


4  

Haskell's lists are heap-based, whereas your C code is exceedingly tight and makes no heap use at all. You need to refactor to remove the dependency on lists.

Haskell的列表是基于堆的,而您的C代码非常紧凑,根本不使用堆。您需要重构以删除对列表的依赖关系。

#5


4  

Even if I'm a bit late, here is mine, I removed the dependency on lists and this solution uses no heap at all too.

即使我有点晚了,这是我的,我删除了对列表的依赖,这个解决方案也不使用堆。

{-# LANGUAGE BangPatterns #-}
-- Compiled with ghc -O2 -fvia-C -optc-O3 -Wall euler.hs
module Main (main) where

searchTo :: Int
searchTo = 1000000

nextNumber :: Int -> Int
nextNumber n = case n `divMod` 2 of
   (k,0) -> k
   _     -> 3*n + 1

sequenceLength :: Int -> Int
sequenceLength n = sl 1 n where
  sl k 1 = k
  sl k x = sl (k + 1) (nextNumber x)

longestSequence :: Int
longestSequence = testValues 1 0 0 where
  testValues number !longest !longestNum
    | number > searchTo     = longestNum
    | otherwise            = testValues (number + 1) longest' longestNum' where
    nlength  = sequenceLength number
    (longest',longestNum') = if nlength > longest
      then (nlength,number)
      else (longest,longestNum)

main :: IO ()
main = print longestSequence

I compiled this piece with ghc -O2 -fvia-C -optc-O3 -Wall euler.hs and it runs in 5 secs, compared to 80 of the beginning implementation. It doesn't uses Integer, but because I'm on a 64-bit machine, the results may be cheated.

我用ghc -O2 - fva - c -optc-O3 -Wall euler编写了这篇文章。hs和它在5秒内运行,相比之下,最初的实现是80秒。它不使用整数,但是因为我在64位机器上,结果可能会被欺骗。

The compiler can unbox all Ints in this case, resulting in really fast code. It runs faster than all other solutions I've seen so far, but C is still faster.

在这种情况下,编译器可以打开所有的Ints,从而产生非常快的代码。它比我到目前为止看到的所有解运行得都快,但是C还是更快。

#1


24  

For testing purpose I have just set searchTo = 100000. The time taken is 7.34s. A few modification leads to some big improvement:

为了测试目的,我设置了searchTo = 100000。所花的时间是7.34秒。一些修改导致了一些大的改进:

  1. Use an Integer instead of Int64. This improves the time to 1.75s.

    使用整数代替Int64。这将时间提高到1.75秒。

  2. Use an accumulator (you don't need sequenceLength to be lazy right?) 1.54s.

    使用一个累加器(你不需要sequenceLength来表示懒惰,对吧?)

    seqLen2 :: Int -> Integer -> Int
    seqLen2 a 1 = a
    seqLen2 a n = seqLen2 (a+1) (nextNumber n)
    
    sequenceLength :: Integer -> Int
    sequenceLength = seqLen2 1
    
  3. Rewrite the nextNumber using quotRem, thus avoiding computing the division twice (once in even and once in div). 1.27s.

    使用quotRem重写下一个数字,从而避免计算两次除法(一次为偶数,一次为div)。1.27 s。

    nextNumber :: Integer -> Integer
    nextNumber n 
        | r == 0    = q
        | otherwise = 6*q + 4
        where (q,r) = quotRem n 2 
    
  4. Use Schwartzian transform instead of maximumBy. The problem of maximumBy . comparing is that the sequenceLength function is called more than once for each value. 0.32s.

    用Schwartzian变换代替maximumBy。极大值问题。比较的是,对每个值调用sequenceLength函数的次数不止一次。0.32 s。

    longestSequence = snd $ maximum [(sequenceLength a, a) | a <- [1..searchTo]]
    

Note:

注意:

  • I check the time by compiling with ghc -O and run with +RTS -s)
  • 我使用ghc -O编译并使用+RTS -s运行来检查时间)
  • My machine is running on Mac OS X 10.6. The GHC version is 6.12.2. The compiled file is in i386 architecture.)
  • 我的机器在Mac OS X 10.6上运行。GHC版本是6.12.2。编译后的文件在i386架构中。
  • The C problem runs at 0.078s with the corresponding parameter. It is compiled with gcc -O3 -m32.
  • C问题的参数为0.078s。它是用gcc -O3 -m32编译的。

#2


11  

Although this is already rather old, let me chime in, there's one crucial point that hasn't been addressed before.

虽然这已经是相当古老的了,让我插嘴说一下,有一个关键的点以前没有被讨论过。

First, the timings of the different programmes on my box. Since I'm on a 64-bit linux system, they show somewhat different characteristics: using Integer instead of Int64 does not improve performance as it would with a 32-bit GHC, where each Int64 operation would incur the cost of a C-call while the computations with Integers fitting in signed 32-bit integers don't need a foreign call (since only few operations exceed that range here, Integer is the better choice on a 32-bit GHC).

首先,我的盒子上的不同节目的时间。因为我在64位的linux系统上,他们表现出不同的特点:使用整数代替Int64不提高性能,因为它将与一个32位的GHC每个Int64操作会招致的成本c调用而签署的与整数计算拟合32位整数不需要外国电话(因为只有几个操作超过这个范围,在一个32位的整数是更好的选择GHC)。

  • C: 0.3 seconds
  • C:0.3秒
  • Original Haskell: 14.24 seconds, using Integer instead of Int64: 33.96 seconds
  • 原始Haskell: 14.24秒,使用整数代替Int64: 33.96秒
  • KennyTM's improved version: 5.55 seconds, using Int: 1.85 seconds
  • KennyTM的改进版本:5.55秒,使用Int: 1.85秒
  • Chris Kuklewicz's version: 5.73 seconds, using Int: 1.90 seconds
  • Chris Kuklewicz的版本:5.73秒,使用Int: 1.90秒
  • FUZxxl's version: 3.56 seconds, using quotRem instead of divMod: 1.79 seconds
  • FUZxxl版本:3.56秒,用quotRem代替divMod: 1.79秒

So what have we?

所以我们什么?

  1. Calculate the length with an accumulator so the compiler can transform it (basically) into a loop
  2. 使用累加器计算长度,这样编译器就可以(基本上)将其转换为循环
  3. Don't recalculate the sequence lengths for the comparisons
  4. 不要为比较重新计算序列长度
  5. Don't use div resp. divMod when it's not necessary, quot resp. quotRem are much faster
  6. 不要使用div职责。在不必要的时候离开," resp "。quotRem要快得多

What is still missing?

失踪是什么?

if (j % 2 == 0)
    j = j / 2;
else
    j = 3 * j + 1;

Any C compiler I have used transforms the test j % 2 == 0 into a bit-masking and doesn't use a division instruction. GHC does not (yet) do that. So testing even n or computing n `quotRem` 2 is quite an expensive operation. Replacing nextNumber in KennyTM's Integer version with

我使用的任何C编译器都将测试j % 2 == 0转换为位屏蔽,并且不使用除法指令。GHC还没有这么做。所以即使是测试n或计算n ' quotRem ' 2都是非常昂贵的操作。在KennyTM的整数版本中替换nextNumber。

nextNumber :: Integer -> Integer
nextNumber n
    | fromInteger n .&. 1 == (0 :: Int) = n `quot` 2
    | otherwise = 3*n+1

reduces its running time to 3.25 seconds (Note: for Integer, n `quot` 2 is faster than n `shiftR` 1, that takes 12.69 seconds!).

将其运行时间减少到3.25秒(注意:对于整数,n ' ' ' 2比n ' shiftR ' 1快,这需要12.69秒!)

Doing the same in the Int version reduces its running time to 0.41 seconds. For Ints, the bit-shift for division by 2 is a bit faster than the quot operation, reducing its running time to 0.39 seconds.

在Int版本中执行相同操作将运行时间减少到0.41秒。对于Ints,除2的位移比“操作”快一点,将其运行时间减少到0.39秒。

Eliminating the construction of the list (that doesn't appear in the C version either),

消除列表的构造(在C版本中也没有出现),

module Main (main) where

import Data.Bits

result :: Int
result = findMax 0 0 1

findMax :: Int -> Int -> Int -> Int
findMax start len can
    | can > 1000000 = start
    | canlen > len = findMax can canlen (can+1)
    | otherwise = findMax start len (can+1)
      where
        canlen = findLen 1 can

findLen :: Int -> Int -> Int
findLen l 1 = l
findLen l n
    | n .&. 1 == 0  = findLen (l+1) (n `shiftR` 1)
    | otherwise     = findLen (l+1) (3*n+1)

main :: IO ()
main = print result

yields a further small speedup, resulting in a running time of 0.37 seconds.

产生进一步的小加速,导致运行时间为0.37秒。

So the Haskell version that's in close correspondence to the C version doesn't take that much longer, it's a factor of ~1.3.

所以Haskell版本和C版本很接近,不会花太长时间,它的因数是1。3。

Well, let's be fair, there's an inefficiency in the C version that's not present in the Haskell versions,

让我们公平地说,C版本有一个低效率在Haskell版本中没有,

if (this_terms > terms)
{
    terms = this_terms;
    longest = i;
}

appearing in the inner loop. Lifting that out of the inner loop in the C version reduces its running time to 0.27 seconds, making the factor ~1.4.

出现在内部循环中。在C版本中从内部循环中提升这个参数,可以将运行时间减少到0.27秒,使系数变为1.4。

#3


5  

The comparing may be recomputing sequenceLength too much. This is my best version:

比较可能是重新计算序列太多了。这是我最好的版本:

type I = Integer
data P = P {-# UNPACK #-} !Int {-# UNPACK #-} !I deriving (Eq,Ord,Show)

searchTo = 1000000

nextNumber :: I -> I
nextNumber n = case quotRem n 2 of
                  (n2,0) -> n2
                  _ -> 3*n+1

sequenceLength :: I -> Int
sequenceLength x = count x 1 where
  count 1 acc = acc
  count n acc = count (nextNumber n) (succ acc)

longestSequence = maximum . map (\i -> P (sequenceLength i) i) $ [1..searchTo]

main = putStrLn $ show $ longestSequence

The answer and timing are slower than C, but it does use arbitrary precision integers (through the Integer type):

答案和时间都比C慢,但它确实使用了任意精度整数(通过整数类型):

ghc -O2 --make euler14-fgij.hs
time ./euler14-fgij
P 525 837799

real 0m3.235s
user 0m3.184s
sys  0m0.015s

#4


4  

Haskell's lists are heap-based, whereas your C code is exceedingly tight and makes no heap use at all. You need to refactor to remove the dependency on lists.

Haskell的列表是基于堆的,而您的C代码非常紧凑,根本不使用堆。您需要重构以删除对列表的依赖关系。

#5


4  

Even if I'm a bit late, here is mine, I removed the dependency on lists and this solution uses no heap at all too.

即使我有点晚了,这是我的,我删除了对列表的依赖,这个解决方案也不使用堆。

{-# LANGUAGE BangPatterns #-}
-- Compiled with ghc -O2 -fvia-C -optc-O3 -Wall euler.hs
module Main (main) where

searchTo :: Int
searchTo = 1000000

nextNumber :: Int -> Int
nextNumber n = case n `divMod` 2 of
   (k,0) -> k
   _     -> 3*n + 1

sequenceLength :: Int -> Int
sequenceLength n = sl 1 n where
  sl k 1 = k
  sl k x = sl (k + 1) (nextNumber x)

longestSequence :: Int
longestSequence = testValues 1 0 0 where
  testValues number !longest !longestNum
    | number > searchTo     = longestNum
    | otherwise            = testValues (number + 1) longest' longestNum' where
    nlength  = sequenceLength number
    (longest',longestNum') = if nlength > longest
      then (nlength,number)
      else (longest,longestNum)

main :: IO ()
main = print longestSequence

I compiled this piece with ghc -O2 -fvia-C -optc-O3 -Wall euler.hs and it runs in 5 secs, compared to 80 of the beginning implementation. It doesn't uses Integer, but because I'm on a 64-bit machine, the results may be cheated.

我用ghc -O2 - fva - c -optc-O3 -Wall euler编写了这篇文章。hs和它在5秒内运行,相比之下,最初的实现是80秒。它不使用整数,但是因为我在64位机器上,结果可能会被欺骗。

The compiler can unbox all Ints in this case, resulting in really fast code. It runs faster than all other solutions I've seen so far, but C is still faster.

在这种情况下,编译器可以打开所有的Ints,从而产生非常快的代码。它比我到目前为止看到的所有解运行得都快,但是C还是更快。