
时间:2022-01-08 08:59:49

I have to find a memory leak in a Java application. I have some experience with this but would like advice on a methodology/strategy for this. Any reference and advice is welcome.


About our situation:


  1. Heap dumps are larger than 1 GB
  2. 堆转储大于1 GB

  3. We have heap dumps from 5 occasions.
  4. 我们有5次堆转储。

  5. We don't have any test case to provoke this. It only happens in the (massive) system test environment after at least a weeks usage.
  6. 我们没有任何测试案例来激发这一点。它只发生在至少一周使用后的(大规模)系统测试环境中。

  7. The system is built on a internally developed legacy framework with so many design flaws that they are impossible to count them all.
  8. 该系统建立在内部开发的遗留框架之上,存在许多设计缺陷,无法统计它们。

  9. Nobody understands the framework in depth. It has been transfered to one guy in India who barely keeps up with answering e-mails.
  10. 没有人深入了解框架。它已被转移到印度的一个人,他几乎没有跟上回复电子邮件。

  11. We have done snapshot heap dumps over time and concluded that there is not a single component increasing over time. It is everything that grows slowly.
  12. 我们已经完成了快照堆转储,并得出结论,没有一个组件随着时间的推移而增加。一切都在缓慢增长。

  13. The above points us in the direction that it is the frameworks homegrown ORM system that increases its usage without limits. (This system maps objects to files?! So not really a ORM)
  14. 以上指出了我们的框架是本土的ORM系统,它可以无限制地增加其使用。 (这个系统将对象映射到文件?!所以不是真正的ORM)

Question: What is the methodology that helped you succeed with hunting down leaks in a enterprise scale application?


7 个解决方案



It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.


Also, you can't know if something is a leak or not without know why the class is there in the first place.


I just spent the past couple of weeks doing exactly this, and I used an iterative process.


First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.


Rather, I relied almost solely on jmap histograms.


I imagine you're familiar with these, but for those not:


jmap -histo:live <pid> > dump.out

creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.


I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.


I ran several different analyses on this data.


I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.

我写了一个脚本来获取两个直方图,并抛弃它们之间的差异。因此,如果java.lang.String在第一个转储中为10,而在第二个转储中为15,我的脚本会吐出“5 java.lang.String”,告诉我它上升了5.如果它已经下降,数字将为负数。

I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.


However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.


To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysis on the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.

为了找到这些,我将数据转换为时间序列并将其加载到数据库Postgres中。 Postgres非常方便,因为它提供了统计聚合函数,因此您可以对数据进行简单的线性回归分析,并找到趋势向上的类,即使它们并不总是位于图表之上。我使用了regr_slope函数,寻找具有正斜率的类。

I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.

我发现这个过程非常成功,而且效率很高。直方图文件并不是非常庞大,并且很容易从主机下载它们。在生产系统上运行它们并不是非常昂贵(它们会强制使用大型GC,并且可能会阻塞VM一段时间)。我在具有2G Java堆的系统上运行它。

Now, all this can do is identify potentially leaking classes.


This is where understanding how the classes are used, and whether they should or should not be their comes in to play.


For example, you may find that you have a lot of Map.Entry classes, or some other system class.


Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.


Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.


Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.


In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).


A profiler may be able to help you track the owners of that "now leaked" class.


But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.




Take a look at Eclipse Memory Analyzer. It's a great tool (and self contained, does not require Eclipse itself installed) which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.

看看Eclipse Memory Analyzer。它是一个很棒的工具(并且自包含,不需要安装Eclipse本身)1)可以非常快速地打开非常大的堆,2)有一些非常好的自动检测工具。后者并不完美,但EMA提供了许多非常好的方法来浏览和查询转储中的对象以找到任何可能的泄漏。

I've used it in the past to help hunt down suspicious leaks.




This answer expands upon @Will-Hartung's. I applied to same process to diagnose one of my memory leaks and thought that sharing the details would save other people time.

这个答案扩展到@ Will-Hartung的。我申请了同样的过程来诊断我的一个内存泄漏,并认为共享细节可以节省其他人的时间。

The idea is to have postgres 'plot' time vs. memory usage of each class, draw a line that summarizes the growth and identify the objects that are growing the fastest:


s   |  Legend:
i   |  *  - data point
z   |  -- - trend
e   |
(   |
b   |                 *
y   |                     --
t   |                  --
e   |             * --    *
s   |           --
)   |       *--      *
    |     --    *
    |  -- *

Convert your heap dumps (need multiple) into a format this is convenient for consumption by postgres from the heap dump format:


 num     #instances         #bytes  class name 
   1:       4632416      392305928  [C
   2:       6509258      208296256  java.util.HashMap$Node
   3:       4615599      110774376  java.lang.String
   5:         16856       68812488  [B
   6:        278914       67329632  [Ljava.util.HashMap$Node;
   7:       1297968       62302464  

To a csv file with a the datetime of each heap dump:


2016.09.20 17:33:40,[C,4632416,392305928
2016.09.20 17:33:40,java.util.HashMap$Node,6509258,208296256
2016.09.20 17:33:40,java.lang.String,4615599,110774376
2016.09.20 17:33:40,[B,16856,68812488

Using this script:


# Example invocation: convert.heap.hist.to.csv.pl -f heap.2016. -dt "2016.09.20 17:33:40"  >> heap.csv 

 my $file;
 my $dt;
 GetOptions (
     "f=s" => \$file,
     "dt=s" => \$dt
 ) or usage("Error in command line arguments");
 open my $fh, '<', $file or die $!;

my $last=0;
my $lastRotation=0;
 while(not eof($fh)) {
     my $line = <$fh>;
     $line =~ s/\R//g; #remove newlines
     #    1:       4442084      369475664  [C
     my ($instances,$size,$class) = ($line =~ /^\s*\d+:\s+(\d+)\s+(\d+)\s+([\$\[\w\.]+)\s*$/) ;
     if($instances) {
         print "$dt,$class,$instances,$size\n";


Create a table to put the data in


CREATE TABLE heap_histogram (
    histwhen timestamp without time zone NOT NULL,
    class character varying NOT NULL,
    instances integer NOT NULL,
    bytes integer NOT NULL

Copy the data into your new table


\COPY heap_histogram FROM 'heap.csv'  WITH DELIMITER ',' CSV ;

Run the slop query against size (num of bytes) query:

针对size(num of bytes)查询运行slop查询:

SELECT class, REGR_SLOPE(bytes,extract(epoch from histwhen)) as slope
    FROM public.heap_histogram
    GROUP BY class
    HAVING REGR_SLOPE(bytes,extract(epoch from histwhen)) > 0
    ORDER BY slope DESC

Interpret the results:


         class             |        slope         
 java.util.ArrayList       |     71.7993806279174
 java.util.HashMap         |     49.0324576155785
 java.lang.String          |     31.7770770326123
 joe.schmoe.BusinessObject |     23.2036817108056
 java.lang.ThreadLocal     |     20.9013528767851

The slope is bytes added per second (since the unit of epoch is in seconds). If you use instances instead of size, then that's the number of instances added per second.


My one of the lines of code creating this joe.schmoe.BusinessObject was responsible for the memory leak. It was creating the object, appending it to an array without checking if it already existed. The other objects were also created along with the BusinessObject near the leaking code.




Can you accelerate time? i.e. can you write a dummy test client that forces it to do a weeks worth of calls/requests etc in a few minutes or hours? These are your biggest friend and if you don't have one - write one.

你能加快时间吗?即你能写一个虚拟测试客户端,迫使它在几分钟或几小时内做几周的电话/请求等吗?这些是你最大的朋友,如果你没有 - 写一个。

We used Netbeans a while ago to analyse heap dumps. It can be a bit slow but it was effective. Eclipse just crashed and the 32bit Windows tools did as well.

我们前一段时间使用Netbeans来分析堆转储。它可能有点慢但它很有效。 Eclipse刚刚崩溃,32位Windows工具也是如此。

If you have access to a 64bit system or a Linux system with 3GB or more you will find it easier to analyse the heap dumps.


Do you have access to change logs and incident reports? Large scale enterprises will normally have change management and incident management teams and this may be useful in tracking down when problems started happening.


When did it start going wrong? Talk to people and try and get some history. You may get someone saying, "Yeah, it was after they fixed XYZ in patch 6.43 that we got weird stuff happening".




I've had success with IBM Heap Analyzer. It offers several views of the heap, including largest drop-off in object size, most frequently occurring objects, and objects sorted by size.

我在IBM Heap Analyzer上取得了成功。它提供了堆的多个视图,包括对象大小的最大丢失,最常出现的对象以及按大小排序的对象。



If it's happening after a week's usage, and your application is as byzantine as you describe, perhaps you're better off restarting it every week ?


I know it's not fixing the problem, but it may be a time-effective solution. Are there time windows when you can have outages ? Can you load balance and fail over one instance whilst keeping the second up ? Perhaps you can trigger a restart when memory consumption breaches a certain limit (perhaps monitoring via JMX or similar).




I've used jhat, this is a bit harsh, but it depends on the kind of framework you had.




It's almost impossible without some understanding of the underlying code. If you understand the underlying code, then you can better sort the wheat from chaff of the zillion bits of information you are getting in your heap dumps.


Also, you can't know if something is a leak or not without know why the class is there in the first place.


I just spent the past couple of weeks doing exactly this, and I used an iterative process.


First, I found the heap profilers basically useless. They can't analyze the enormous heaps efficiently.


Rather, I relied almost solely on jmap histograms.


I imagine you're familiar with these, but for those not:


jmap -histo:live <pid> > dump.out

creates a histogram of the live heap. In a nutshell, it tells you the class names, and how many instances of each class are in the heap.


I was dumping out heap regularly, every 5 minutes, 24hrs a day. That may well be too granular for you, but the gist is the same.


I ran several different analyses on this data.


I wrote a script to take two histograms, and dump out the difference between them. So, if java.lang.String was 10 in the first dump, and 15 in the second, my script would spit out "5 java.lang.String", telling me it went up by 5. If it had gone down, the number would be negative.

我写了一个脚本来获取两个直方图,并抛弃它们之间的差异。因此,如果java.lang.String在第一个转储中为10,而在第二个转储中为15,我的脚本会吐出“5 java.lang.String”,告诉我它上升了5.如果它已经下降,数字将为负数。

I would then take several of these differences, strip out all classes that went down from run to run, and take a union of the result. At the end, I'd have a list of classes that continually grew over a specific time span. Obviously, these are prime candidates for leaking classes.


However, some classes have some preserved while others are GC'd. These classes could easily go up and down in overall, yet still leak. So, they could fall out of the "always rising" category of classes.


To find these, I converted the data in to a time series and loaded it in a database, Postgres specifically. Postgres is handy because it offers statistical aggregate functions, so you can do simple linear regression analysis on the data, and find classes that trend up, even if they aren't always on top of the charts. I used the regr_slope function, looking for classes with a positive slope.

为了找到这些,我将数据转换为时间序列并将其加载到数据库Postgres中。 Postgres非常方便,因为它提供了统计聚合函数,因此您可以对数据进行简单的线性回归分析,并找到趋势向上的类,即使它们并不总是位于图表之上。我使用了regr_slope函数,寻找具有正斜率的类。

I found this process very successful, and really efficient. The histograms files aren't insanely large, and it was easy to download them from the hosts. They weren't super expensive to run on the production system (they do force a large GC, and may block the VM for a bit). I was running this on a system with a 2G Java heap.

我发现这个过程非常成功,而且效率很高。直方图文件并不是非常庞大,并且很容易从主机下载它们。在生产系统上运行它们并不是非常昂贵(它们会强制使用大型GC,并且可能会阻塞VM一段时间)。我在具有2G Java堆的系统上运行它。

Now, all this can do is identify potentially leaking classes.


This is where understanding how the classes are used, and whether they should or should not be their comes in to play.


For example, you may find that you have a lot of Map.Entry classes, or some other system class.


Unless you're simply caching String, the fact is these system classes, while perhaps the "offenders", are not the "problem". If you're caching some application class, THAT class is a better indicator of where your problem lies. If you don't cache com.app.yourbean, then you won't have the associated Map.Entry tied to it.


Once you have some classes, you can start crawling the code base looking for instances and references. Since you have your own ORM layer (for good or ill), you can at least readily look at the source code to it. If you ORM is caching stuff, it's likely caching ORM classes wrapping your application classes.


Finally, another thing you can do, is once you know the classes, you can start up a local instance of the server, with a much smaller heap and smaller dataset, and using one of the profilers against that.


In this case, you can do unit test that affects only 1 (or small number) of the things you think may be leaking. For example, you could start up the server, run a histogram, perform a single action, and run the histogram again. You leaking class should have increased by 1 (or whatever your unit of work is).


A profiler may be able to help you track the owners of that "now leaked" class.


But, in the end, you're going to have to have some understanding of your code base to better understand what's a leak, and what's not, and why an object exists in the heap at all, much less why it may be being retained as a leak in your heap.




Take a look at Eclipse Memory Analyzer. It's a great tool (and self contained, does not require Eclipse itself installed) which 1) can open up very large heaps very fast and 2) has some pretty good automatic detection tools. The latter isn't perfect, but EMA provides a lot of really nice ways to navigate through and query the objects in the dump to find any possible leaks.

看看Eclipse Memory Analyzer。它是一个很棒的工具(并且自包含,不需要安装Eclipse本身)1)可以非常快速地打开非常大的堆,2)有一些非常好的自动检测工具。后者并不完美,但EMA提供了许多非常好的方法来浏览和查询转储中的对象以找到任何可能的泄漏。

I've used it in the past to help hunt down suspicious leaks.




This answer expands upon @Will-Hartung's. I applied to same process to diagnose one of my memory leaks and thought that sharing the details would save other people time.

这个答案扩展到@ Will-Hartung的。我申请了同样的过程来诊断我的一个内存泄漏,并认为共享细节可以节省其他人的时间。

The idea is to have postgres 'plot' time vs. memory usage of each class, draw a line that summarizes the growth and identify the objects that are growing the fastest:


s   |  Legend:
i   |  *  - data point
z   |  -- - trend
e   |
(   |
b   |                 *
y   |                     --
t   |                  --
e   |             * --    *
s   |           --
)   |       *--      *
    |     --    *
    |  -- *

Convert your heap dumps (need multiple) into a format this is convenient for consumption by postgres from the heap dump format:


 num     #instances         #bytes  class name 
   1:       4632416      392305928  [C
   2:       6509258      208296256  java.util.HashMap$Node
   3:       4615599      110774376  java.lang.String
   5:         16856       68812488  [B
   6:        278914       67329632  [Ljava.util.HashMap$Node;
   7:       1297968       62302464  

To a csv file with a the datetime of each heap dump:


2016.09.20 17:33:40,[C,4632416,392305928
2016.09.20 17:33:40,java.util.HashMap$Node,6509258,208296256
2016.09.20 17:33:40,java.lang.String,4615599,110774376
2016.09.20 17:33:40,[B,16856,68812488

Using this script:


# Example invocation: convert.heap.hist.to.csv.pl -f heap.2016. -dt "2016.09.20 17:33:40"  >> heap.csv 

 my $file;
 my $dt;
 GetOptions (
     "f=s" => \$file,
     "dt=s" => \$dt
 ) or usage("Error in command line arguments");
 open my $fh, '<', $file or die $!;

my $last=0;
my $lastRotation=0;
 while(not eof($fh)) {
     my $line = <$fh>;
     $line =~ s/\R//g; #remove newlines
     #    1:       4442084      369475664  [C
     my ($instances,$size,$class) = ($line =~ /^\s*\d+:\s+(\d+)\s+(\d+)\s+([\$\[\w\.]+)\s*$/) ;
     if($instances) {
         print "$dt,$class,$instances,$size\n";


Create a table to put the data in


CREATE TABLE heap_histogram (
    histwhen timestamp without time zone NOT NULL,
    class character varying NOT NULL,
    instances integer NOT NULL,
    bytes integer NOT NULL

Copy the data into your new table


\COPY heap_histogram FROM 'heap.csv'  WITH DELIMITER ',' CSV ;

Run the slop query against size (num of bytes) query:

针对size(num of bytes)查询运行slop查询:

SELECT class, REGR_SLOPE(bytes,extract(epoch from histwhen)) as slope
    FROM public.heap_histogram
    GROUP BY class
    HAVING REGR_SLOPE(bytes,extract(epoch from histwhen)) > 0
    ORDER BY slope DESC

Interpret the results:


         class             |        slope         
 java.util.ArrayList       |     71.7993806279174
 java.util.HashMap         |     49.0324576155785
 java.lang.String          |     31.7770770326123
 joe.schmoe.BusinessObject |     23.2036817108056
 java.lang.ThreadLocal     |     20.9013528767851

The slope is bytes added per second (since the unit of epoch is in seconds). If you use instances instead of size, then that's the number of instances added per second.


My one of the lines of code creating this joe.schmoe.BusinessObject was responsible for the memory leak. It was creating the object, appending it to an array without checking if it already existed. The other objects were also created along with the BusinessObject near the leaking code.




Can you accelerate time? i.e. can you write a dummy test client that forces it to do a weeks worth of calls/requests etc in a few minutes or hours? These are your biggest friend and if you don't have one - write one.

你能加快时间吗?即你能写一个虚拟测试客户端,迫使它在几分钟或几小时内做几周的电话/请求等吗?这些是你最大的朋友,如果你没有 - 写一个。

We used Netbeans a while ago to analyse heap dumps. It can be a bit slow but it was effective. Eclipse just crashed and the 32bit Windows tools did as well.

我们前一段时间使用Netbeans来分析堆转储。它可能有点慢但它很有效。 Eclipse刚刚崩溃,32位Windows工具也是如此。

If you have access to a 64bit system or a Linux system with 3GB or more you will find it easier to analyse the heap dumps.


Do you have access to change logs and incident reports? Large scale enterprises will normally have change management and incident management teams and this may be useful in tracking down when problems started happening.


When did it start going wrong? Talk to people and try and get some history. You may get someone saying, "Yeah, it was after they fixed XYZ in patch 6.43 that we got weird stuff happening".




I've had success with IBM Heap Analyzer. It offers several views of the heap, including largest drop-off in object size, most frequently occurring objects, and objects sorted by size.

我在IBM Heap Analyzer上取得了成功。它提供了堆的多个视图,包括对象大小的最大丢失,最常出现的对象以及按大小排序的对象。



If it's happening after a week's usage, and your application is as byzantine as you describe, perhaps you're better off restarting it every week ?


I know it's not fixing the problem, but it may be a time-effective solution. Are there time windows when you can have outages ? Can you load balance and fail over one instance whilst keeping the second up ? Perhaps you can trigger a restart when memory consumption breaches a certain limit (perhaps monitoring via JMX or similar).




I've used jhat, this is a bit harsh, but it depends on the kind of framework you had.
