如何减少Haskell应用程序中的内存使用量？

I am new to functional programming, and now learn Haskell. As an exercise I decided to implement the explicit Euler method for 1D linear diffusion equation. While the code below works correctly, I am not happy about its performance. In fact, I am concerned with memory consumption. I believe that it is related to lazy evaluation, but cannot figure out how I can reduce its memory usage.

我是函数式编程的新手,现在学习Haskell。作为练习,我决定实现一维线性扩散方程的显式欧拉方法。虽然下面的代码工作正常,但我对它的性能并不满意。事实上,我关心的是内存消耗。我相信它与懒惰的评估有关,但无法弄清楚如何减少其内存使用量。

The idea of the algorithm is really simple, to make it clear in imperative terms: it takes an `array', and to every inner point it adds a value, which is calculated as a combination of the values in the point itself and in its neighbors. Boundary points are special cases.

算法的想法非常简单,用命令性的术语表达:它采用“数组”,并且每个内部点都添加一个值,该值是作为点本身和其中的值的组合计算的。邻居。边界点是特殊情况。

So, this is my Euler1D.hs module:

所以,这是我的Euler1D.hs模块:

module Euler1D
( stepEuler
, makeu0
) where

-- impose zero flux condition
zeroflux :: (Floating a) => a -> [a] -> [a]
zeroflux mu (boundary:inner:xs) = [boundary+mu*2*(inner-boundary)]

-- one step of integration
stepEuler :: (Floating a) => a -> [a] -> [a]
stepEuler mu u@(x:xs) = (applyBC . (diffused mu)) u
    where
          diffused mu (left:x:[]) = []    -- ignore outer points
          diffused mu (left:x:right:xs) = -- integrate inner points
                   (x+mu*(left+right-2*x)) : diffused mu (x:right:xs)
          applyBC inner = (lbc u') ++ inner ++ (rbc u') -- boundary conditions
               where u' = [head u] ++ inner ++ [last u]
                     lbc = zeroflux mu             -- left boundary
                     rbc = (zeroflux mu) . reverse -- right boundary

-- initial condition
makeu0 :: Int -> [Double]
makeu0 n = [ ((^2) . sin . (pi*) . xi) x | x <- [0..n]]
    where xi x = fromIntegral x / fromIntegral n

And a simple Main.hs:

还有一个简单的Main.hs:

module Main where

import System ( getArgs )
import Euler1D

main = do
      args <- getArgs
      let n = read $ head args :: Int
      let u0 = makeu0 n
      let un = stepEuler 0.5 u0
      putStrLn $ show $ sum un

For comparison, I also wrote a pure C implementation.

为了比较,我还写了一个纯C实现。

Now, if I try to run Haskell implementation for a sufficiently large size of the array n, I have:

现在,如果我尝试为足够大的数组n运行Haskell实现,我有:

$ time ./eulerhs 200000
100000.00000000112

real    0m3.552s
user    0m3.304s
sys     0m0.128s

For comparison, C version is faster by almost two orders of magnitude:

相比之下,C版本的速度提高了近两个数量级:

$ time ./eulerc 200000
100000

real    0m0.088s
user    0m0.048s
sys     0m0.008s

EDIT: This comparison is not really fair, because Haskell version is compiled with profiling flags, and C is not. If I compile both programs with -O2 and both without profiling flags, I can increase n. In this case time ./eulerhs 1000000 takes 0m2.236s, while time ./eulerc 1000000 takes only 0m0.293s. So the problem still remains with all optimizations and without profiling, it is only offset.

编辑:这种比较并不公平,因为Haskell版本是使用分析标志编译的,而C则不是。如果我用-O2编译这两个程序并且两者都没有分析标志,我可以增加n。在这种情况下,时间./eulerhs 1000000需要0m2.236s,而时间./eulerc 1000000只需要0m0.293s。因此,问题仍然存在于所有优化和没有分析,它只是抵消。

I would like also to note, that memory allocation of the Haskell program seems to grow lineary with n. This is probably OK.

我还要注意,Haskell程序的内存分配似乎与n一致。这可能没关系。

But the worst are memory requirements. My Haskell version requires over 100MB (my estimation of the bare minimum in C is 4MB). I think this may be the source of the problem. According to profiling report the program spends 85% of time in GC, and

但最糟糕的是内存要求。我的Haskell版本需要超过100MB(我估计C中的最小值是4MB)。我认为这可能是问题的根源。根据分析报告,该程序花费85%的时间用于GC,并且

        total time  =        0.36 secs   (18 ticks @ 20 ms)
        total alloc = 116,835,180 bytes  (excludes profiling overheads)

COST CENTRE                    MODULE               %time %alloc

makeu0                         Euler1D               61.1   34.9
stepEuler                      Euler1D               33.3   59.6
CAF:sum                        Main                   5.6    5.5

I was surprized to see that makeu0 is so expensive. I decided that this is due to its lazy evaluation (if its thunks remain in the memory until the end of stepEuler).

我很惊讶看到makeu0太贵了。我之所以认为这是因为它的懒惰评估(如果它的thunk保留在内存中,直到stepEuler结束)。

I tried this change in Main.hs:

我在Main.hs中尝试了这个更改:

   let un = u0 `seq` stepEuler 0.5 u0

but did not notice any difference. I have no idea how to reduce memory usage in stepEuler. So, my questions are:

但没有发现任何差异。我不知道如何减少stepEuler中的内存使用量。所以,我的问题是:

Is there a way in Haskell to build lists / do list comprehensions strictly? In this case there is no benefit to keep it lazy.

在Haskell中有一种方法可以严格地构建列表/列表推导吗?在这种情况下,保持懒惰是没有好处的。

How can I reduce overall memory usage in this case? I suppose, I have to make something strict, but fail to see what. In other words, if I have to put some seqs and bangs, where and why?

在这种情况下,如何减少总体内存使用量?我想,我必须做一些严格的事情,但却没有看到什么。换句话说,如果我必须放一些seqs和刘海,在哪里以及为什么?

And finally, the most important, what is the best strategy to identify such costly constructs?

最后,最重要的是,识别这种昂贵结构的最佳策略是什么?

I did read a chapter on profiling and optimization in Real World Haskell, but it remains unclear how exactly I can decide what should be strict and what not.

我确实阅读了关于Real World Haskell中的分析和优化的章节,但我仍然不清楚我究竟能够确定什么应该是严格的,什么不是。

Please forgive me such a long post.

请原谅我这么长的帖子。

EDIT2: As suggested by A. Rex in comments, I tried running both programs in valgrind. And this is what I observed. For Haskell program (n=200000) it found:

EDIT2:正如A. Rex在评论中所建议的那样,我尝试在valgrind中运行这两个程序。这就是我观察到的。对于Haskell程序(n = 200000),它发现:

malloc/free: 33 allocs, 30 frees, 84,109 bytes allocated. ... checked 55,712,980 bytes.

malloc / free:分配33个allocs,30个frees,84,109个字节。 ...检查了55,712,980字节。

And for C program (after a small fix):

对于C程序(经过小修复):

malloc/free: 2 allocs, 2 frees, 3,200,000 bytes allocated.

malloc / free:分配2个allocs,2个frees,3,200,000个字节。

So, it appears that while Haskell allocates much smaller memory blocks, it does it often, and due to delay in garbage collection, they accumulate and remain in memory. So, I have another question:

因此,看起来虽然Haskell分配了更小的内存块,但它经常这样做,并且由于垃圾收集的延迟,它们会累积并保留在内存中。所以,我有另一个问题:

Is it possible to avoid a lot of small allocations in Haskell? Basically, to declare, that I need to process the whole data structure rather than only its fragments on demand.

是否可以避免Haskell中的大量小分配?基本上,要声明,我需要处理整个数据结构而不是仅按需处理它的片段。

7 个解决方案

#1

Lists are not the best datastructure for this type of code (with lots of (++), and (last)). You lose a lot of time constucting and deconstructing lists. I'd use Data.Sequence or arrays, as in C versions.

列表不是这类代码的最佳数据结构(有很多(++)和(最后))。你失去了大量的时间来构建和解构列表。我在C版本中使用Data.Sequence或数组。
There is no chance for thunks of makeu0 to be garbage-collected, since you need to retain all of them (well, all of the results of "diffuse", to be exact) all the way till the end of computation in order to be able to do "reverse" in applyBC. Which is very expensive thing, considering that you only need two items from the tail of the list for your "zeroflux".

makeu0的thunks没有机会被垃圾收集,因为你需要保留所有这些(好吧,所有“漫反射”的结果,确切地说)直到计算结束才能成为能够在applyBC中做“逆转”。这是非常昂贵的事情,考虑到您只需要列表尾部的两个项目为您的“zeroflux”。

Here is fast hack of you code that tries to achieve better list fusion and does less list (de)constructing:

这是快速破解您的代码,试图实现更好的列表融合,并减少列表(de)构建:

module Euler1D
( stepEuler
) where

-- impose zero flux condition
zeroflux mu (boundary:inner:xs) = boundary+mu*2*(inner-boundary)

-- one step of integration
stepEuler mu n = (applyBC . (diffused mu)) $ makeu0 n
    where
          diffused mu (left:x:[]) = []    -- ignore outer points
          diffused mu (left:x:right:xs) = -- integrate inner points
                   let y = (x+mu*(left+right-2*x))
                       in y `seq` y : diffused mu (x:right:xs)
          applyBC inner = lbc + sum inner + rbc -- boundary conditions
               where
                     lbc = zeroflux mu ((f 0 n):inner)             -- left boundary
                     rbc = zeroflux mu ((f n n):(take 2 $ reverse inner)) -- right boundary

-- initial condition
makeu0 n = [ f x n | x <- [0..n]]

f x n = ((^2) . sin . (pi*) . xi) x
    where xi x = fromIntegral x / fromIntegral n

For 200000 points, it completes in 0.8 seconds vs 3.8 seconds for initial version

对于200000点,它在0.8秒内完成,初始版本为3.8秒

#2

On my 32-bit x86 system, your program uses only about 40 MB of memory.

在我的32位x86系统上,您的程序仅使用大约40 MB的内存。

Are you perhaps confusing the the "total alloc = 116,835,180 bytes" line from your profiling output with how much memory is actually used by the program at any one time? The total alloc is how much memory was allocated over the entire program run; much of this is freed by the garbage collector as you go along. You can expect that number to get very large in a Haskell program; I have programs that allocate many terrabytes of memory over the course of their entire run, though they actually have a maximum virtual memory size of a hundred megabytes or so.

您是否可能会混淆分析输出中的“total alloc = 116,835,180 bytes”行,以及程序在任何时候实际使用了多少内存?总alloc是在整个程序运行中分配的内存量;随着你的进展,垃圾收集器释放了大部分内容。你可以期望这个数字在Haskell程序中变得非常大;我有程序在整个运行过程中分配了许多TB的内存,尽管它们实际上有一个最大的虚拟内存大小为一百兆左右。

I wouldn't worry too much about large total allocations over the course of a program run; that's the nature of a pure language, and GHC's runtime has a very good garbage collector to help compensate for this.

在程序运行过程中,我不会过分担心大量的总分配;这是纯语言的本质,GHC的运行时有一个非常好的垃圾收集器来帮助弥补这一点。

#3

More generally, you can find out where your memory is going using GHC's heap profiling tools. In my experience, they won't necessarily tell you why your data is being leaked, but can at least narrow down the potential causes.

更一般地说,您可以使用GHC的堆分析工具找出内存的位置。根据我的经验,他们不一定会告诉你为什么你的数据被泄露,但至少可以缩小潜在的原因。

You may also find illuminating this excellent blog post by Don Stewart about understanding strictness, how it interacts with garbage collection, and how to diagnose and fix problems.

您还可以找到Don Stewart关于理解严格性,如何与垃圾收集交互以及如何诊断和修复问题的优秀博客文章。

#4

Does forcing "un-laziness" using $! help? as per this answer.

使用$强制“不懒惰”!救命?根据这个答案。

#5

Per Harleqin's request: Have you tried setting optimization flags? For example, with ghc, you can use add "-O2" just like you would with gcc. (Although I'm not sure what optimization levels exist in ghc; the man page doesn't exactly say ...)

Per Harleqin的要求:您是否尝试过设置优化标志?例如,使用ghc,您可以使用添加“-O2”,就像使用gcc一样。 (虽然我不确定ghc中存在哪些优化级别;手册页并没有确切地说......)

In my past experience, setting this flag has made a tremendous difference. As far as I can tell, runhugs and unoptimized ghc use the most basic, obvious implementation of Haskell; unfortunately, this sometimes isn't very efficient.

根据我过去的经验,设置这个标志已经产生了巨大的变化。 runkugs和unoptimized ghc使用Haskell最基本,最明显的实现;不幸的是,这有时效率不高。

But that's just a guess. As I said in my comment, I hope that someone answers your question well. I often have problems analyzing and fixing Haskell's memory usage.

但这只是猜测。正如我在评论中所说,我希望有人能很好地回答你的问题。我经常在分析和修复Haskell的内存使用方面遇到问题。

#6

Use switch -fvia-C also.

也可以使用switch -fvia-C。

#7

One thing that jumped to my eye now is that the Haskell output is a float, while the C output seems to be integer. I have not yet come to grips with Haskell code, but is there perhaps some place where you have floating point arithmetic in Haskell while C uses integers?

现在跳到我眼前的一件事是Haskell输出是一个浮点数,而C输出似乎是整数。我还没有掌握Haskell代码,但是在Cask使用整数时,是否有一些地方你在Haskell中有浮点运算?

#1

Lists are not the best datastructure for this type of code (with lots of (++), and (last)). You lose a lot of time constucting and deconstructing lists. I'd use Data.Sequence or arrays, as in C versions.

列表不是这类代码的最佳数据结构(有很多(++)和(最后))。你失去了大量的时间来构建和解构列表。我在C版本中使用Data.Sequence或数组。
There is no chance for thunks of makeu0 to be garbage-collected, since you need to retain all of them (well, all of the results of "diffuse", to be exact) all the way till the end of computation in order to be able to do "reverse" in applyBC. Which is very expensive thing, considering that you only need two items from the tail of the list for your "zeroflux".

makeu0的thunks没有机会被垃圾收集,因为你需要保留所有这些(好吧,所有“漫反射”的结果,确切地说)直到计算结束才能成为能够在applyBC中做“逆转”。这是非常昂贵的事情,考虑到您只需要列表尾部的两个项目为您的“zeroflux”。