理解Linux oom杀手的日志

时间:2021-05-24 20:59:04

My app was killed by the oom-killer. It is Ubuntu 11.10 running on a live USB with no swap and the PC has 1 Gig of RAM. The only app running (other than all the built in Ubuntu stuff) is my program flasherav. Note that /tmp is memory mapped and at the time of the crash had about 200MB of files in it (so was taking up ~200MB of RAM).

我的应用程序被杀了。它是ubuntu11.10,运行在一个没有交换的实时USB上,PC有1g的RAM。唯一运行的应用程序(除了所有Ubuntu内置的东西)是我的程序flasherav。注意,/tmp是内存映射的,在崩溃时,它有大约200MB的文件(因此占用了大约200MB的RAM)。

I'm trying to understand how to analyze the om-killer log such that I can understand where exactly all the memory is being used- i.e. what are the different chunks that will add up to ~1 gig which resulted in the oom-killer kicking in? Once I understand that, I can work on reducing the offender's usage so the app will run on a machine with 1 GB of ram. My specific questions are.

我正在尝试理解如何分析杀人记录,这样我就可以理解所有的内存都在哪里被使用——也就是说,有哪些不同的块加起来就会导致杀人魔开始攻击?一旦我理解了这一点,我就可以减少违规者的使用,这样应用程序就会运行在一台有1gb内存的机器上。我的具体问题。

To try to analyze the situation, I summed up the "total_vm" column and I only get 609342KB (which when added to the 200MB in /tmp is still only 809MB). Maybe I'm wrong on what the "total_vm" column is- does it include allocated but not used memory plus shared memory. If yes, then shouldn't it far overstate actually used memory (and therefore I shouldn't be out of memory), right? Are there other chunks of memory in use that aren't accounted for in the list below?

为了分析这种情况,我总结了“total_vm”列,只得到609342KB(添加到200MB in /tmp时,仍然只有809MB)。也许我在“total_vm”列上错了——它包含已分配但未使用的内存和共享内存吗?如果是,那么它是否应该夸大实际使用的内存(因此我不应该超出内存),对吗?是否有其他内存块在使用中没有在下面的列表中说明?

[11686.040460] flasherav invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0, oom_score_adj=0
[11686.040467] flasherav cpuset=/ mems_allowed=0
[11686.040472] Pid: 2859, comm: flasherav Not tainted 3.0.0-12-generic #20-Ubuntu
[11686.040476] Call Trace:
[11686.040488]  [<c10e1c15>] dump_header.isra.7+0x85/0xc0
[11686.040493]  [<c10e1e6c>] oom_kill_process+0x5c/0x80
[11686.040498]  [<c10e225f>] out_of_memory+0xbf/0x1d0
[11686.040503]  [<c10e6123>] __alloc_pages_nodemask+0x6c3/0x6e0
[11686.040509]  [<c10e78d3>] ? __do_page_cache_readahead+0xe3/0x170
[11686.040514]  [<c10e0fc8>] filemap_fault+0x218/0x390
[11686.040519]  [<c1001c24>] ? __switch_to+0x94/0x1a0
[11686.040525]  [<c10fb5ee>] __do_fault+0x3e/0x4b0
[11686.040530]  [<c1069971>] ? enqueue_hrtimer+0x21/0x80
[11686.040535]  [<c10fec2c>] handle_pte_fault+0xec/0x220
[11686.040540]  [<c10fee68>] handle_mm_fault+0x108/0x210
[11686.040546]  [<c152fa00>] ? vmalloc_fault+0xee/0xee
[11686.040551]  [<c152fb5b>] do_page_fault+0x15b/0x4a0
[11686.040555]  [<c1069a90>] ? update_rmtp+0x80/0x80
[11686.040560]  [<c106a7b6>] ? hrtimer_start_range_ns+0x26/0x30
[11686.040565]  [<c106aeaf>] ? sys_nanosleep+0x4f/0x60
[11686.040569]  [<c152fa00>] ? vmalloc_fault+0xee/0xee
[11686.040574]  [<c152cfcf>] error_code+0x67/0x6c
[11686.040580]  [<c1520000>] ? reserve_backup_gdb.isra.11+0x26d/0x2c0
[11686.040583] Mem-Info:
[11686.040585] DMA per-cpu:
[11686.040588] CPU    0: hi:    0, btch:   1 usd:   0
[11686.040592] CPU    1: hi:    0, btch:   1 usd:   0
[11686.040594] Normal per-cpu:
[11686.040597] CPU    0: hi:  186, btch:  31 usd:   5
[11686.040600] CPU    1: hi:  186, btch:  31 usd:  30
[11686.040603] HighMem per-cpu:
[11686.040605] CPU    0: hi:   42, btch:   7 usd:   7
[11686.040608] CPU    1: hi:   42, btch:   7 usd:  22
[11686.040613] active_anon:113150 inactive_anon:113378 isolated_anon:0
[11686.040615]  active_file:86 inactive_file:1964 isolated_file:0
[11686.040616]  unevictable:0 dirty:0 writeback:0 unstable:0
[11686.040618]  free:13274 slab_reclaimable:2239 slab_unreclaimable:2594
[11686.040619]  mapped:1387 shmem:4380 pagetables:1375 bounce:0
[11686.040627] DMA free:4776kB min:784kB low:980kB high:1176kB active_anon:5116kB inactive_anon:5472kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:80kB slab_unreclaimable:168kB kernel_stack:96kB pagetables:64kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:6 all_unreclaimable? yes
[11686.040634] lowmem_reserve[]: 0 865 1000 1000
[11686.040644] Normal free:48212kB min:44012kB low:55012kB high:66016kB active_anon:383196kB inactive_anon:383704kB active_file:344kB inactive_file:7884kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:885944kB mlocked:0kB dirty:0kB writeback:0kB mapped:5548kB shmem:17520kB slab_reclaimable:8876kB slab_unreclaimable:10208kB kernel_stack:1960kB pagetables:3976kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:930 all_unreclaimable? yes
[11686.040652] lowmem_reserve[]: 0 0 1078 1078
[11686.040662] HighMem free:108kB min:132kB low:1844kB high:3560kB active_anon:64288kB inactive_anon:64336kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:138072kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:1460kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:61 all_unreclaimable? yes
[11686.040669] lowmem_reserve[]: 0 0 0 0
[11686.040675] DMA: 20*4kB 24*8kB 34*16kB 26*32kB 19*64kB 13*128kB 1*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4784kB
[11686.040690] Normal: 819*4kB 607*8kB 357*16kB 176*32kB 99*64kB 49*128kB 23*256kB 4*512kB 0*1024kB 0*2048kB 2*4096kB = 48212kB
[11686.040704] HighMem: 16*4kB 0*8kB 1*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 80kB
[11686.040718] 14680 total pagecache pages
[11686.040721] 8202 pages in swap cache
[11686.040724] Swap cache stats: add 2191074, delete 2182872, find 1247325/1327415
[11686.040727] Free swap  = 0kB
[11686.040729] Total swap = 524284kB
[11686.043240] 262100 pages RAM
[11686.043244] 34790 pages HighMem
[11686.043246] 5610 pages reserved
[11686.043248] 2335 pages shared
[11686.043250] 240875 pages non-shared
[11686.043253] [ pid ]   uid  tgid total_vm      rss cpu oom_adj oom_score_adj name
[11686.043266] [ 1084]     0  1084      662        1   0       0             0 upstart-udev-br
[11686.043271] [ 1094]     0  1094      743       79   0     -17         -1000 udevd
[11686.043276] [ 1104]   101  1104     7232       42   0       0             0 rsyslogd
[11686.043281] [ 1149]   103  1149     1066      188   1       0             0 dbus-daemon
[11686.043286] [ 1165]     0  1165     1716       66   0       0             0 modem-manager
[11686.043291] [ 1220]   106  1220      861       42   0       0             0 avahi-daemon
[11686.043296] [ 1221]   106  1221      829        0   1       0             0 avahi-daemon
[11686.043301] [ 1255]     0  1255     6880      117   0       0             0 NetworkManager
[11686.043306] [ 1308]     0  1308     5988      144   0       0             0 polkitd
[11686.043311] [ 1334]     0  1334      723       85   0     -17         -1000 udevd
[11686.043316] [ 1335]     0  1335      730      108   0     -17         -1000 udevd
[11686.043320] [ 1375]     0  1375      663       37   0       0             0 upstart-socket-
[11686.043325] [ 1464]     0  1464     1333      120   1       0             0 login
[11686.043330] [ 1467]     0  1467     1333      135   1       0             0 login
[11686.043335] [ 1486]     0  1486     1333      135   1       0             0 login
[11686.043339] [ 1487]     0  1487     1333      136   1       0             0 login
[11686.043344] [ 1493]     0  1493     1333      134   1       0             0 login
[11686.043349] [ 1528]     0  1528      496       45   0       0             0 acpid
[11686.043354] [ 1529]     0  1529      607       46   1       0             0 cron
[11686.043359] [ 1549]     0  1549    10660      100   0       0             0 lightdm
[11686.043363] [ 1550]     0  1550      570       28   0       0             0 atd
[11686.043368] [ 1584]     0  1584      855       35   0       0             0 irqbalance
[11686.043373] [ 1703]     0  1703    17939     9653   0       0             0 Xorg
[11686.043378] [ 1874]     0  1874     7013      174   0       0             0 console-kit-dae
[11686.043382] [ 1958]     0  1958     1124       52   1       0             0 bluetoothd
[11686.043388] [ 2048]   999  2048     2435      641   1       0             0 bash
[11686.043392] [ 2049]   999  2049     2435      595   0       0             0 bash
[11686.043397] [ 2050]   999  2050     2435      587   1       0             0 bash
[11686.043402] [ 2051]   999  2051     2435      634   1       0             0 bash
[11686.043406] [ 2054]   999  2054     2435      569   0       0             0 bash
[11686.043411] [ 2155]     0  2155     1333      128   0       0             0 login
[11686.043416] [ 2222]     0  2222      684       67   1       0             0 dhclient
[11686.043420] [ 2240]   999  2240     2435      415   0       0             0 bash
[11686.043425] [ 2244]     0  2244     3631       58   0       0             0 accounts-daemon
[11686.043430] [ 2258]   999  2258    11683      277   0       0             0 gnome-session
[11686.043435] [ 2407]   999  2407      964       24   0       0             0 ssh-agent
[11686.043440] [ 2410]   999  2410      937       53   0       0             0 dbus-launch
[11686.043444] [ 2411]   999  2411     1319      300   1       0             0 dbus-daemon
[11686.043449] [ 2413]   999  2413     2287       88   0       0             0 gvfsd
[11686.043454] [ 2418]   999  2418     7867      123   1       0             0 gvfs-fuse-daemo
[11686.043459] [ 2427]   999  2427    32720      804   0       0             0 gnome-settings-
[11686.043463] [ 2437]   999  2437    10750      124   0       0             0 gnome-keyring-d
[11686.043468] [ 2442]   999  2442     2321      244   1       0             0 gconfd-2
[11686.043473] [ 2447]     0  2447     6490      156   0       0             0 upowerd
[11686.043478] [ 2467]   999  2467     7590       87   0       0             0 dconf-service
[11686.043482] [ 2529]   999  2529    11807      211   0       0             0 gsd-printer
[11686.043487] [ 2531]   999  2531    12162      587   0       0             0 metacity
[11686.043492] [ 2535]   999  2535    19175      960   0       0             0 unity-2d-panel
[11686.043496] [ 2536]   999  2536    19408     1012   0       0             0 unity-2d-launch
[11686.043502] [ 2539]   999  2539    16154     1120   1       0             0 nautilus
[11686.043506] [ 2540]   999  2540    17888      534   0       0             0 nm-applet
[11686.043511] [ 2541]   999  2541     7005      253   0       0             0 polkit-gnome-au
[11686.043516] [ 2544]   999  2544     8930      430   0       0             0 bamfdaemon
[11686.043521] [ 2545]   999  2545    11217      442   1       0             0 bluetooth-apple
[11686.043525] [ 2547]   999  2547      510       16   0       0             0 sh
[11686.043530] [ 2548]   999  2548    11205      301   1       0             0 gnome-fallback-
[11686.043535] [ 2565]   999  2565     6614      179   1       0             0 gvfs-gdu-volume
[11686.043539] [ 2567]     0  2567     5812      164   1       0             0 udisks-daemon
[11686.043544] [ 2571]     0  2571     1580       69   0       0             0 udisks-daemon
[11686.043549] [ 2579]   999  2579    16354     1035   0       0             0 unity-panel-ser
[11686.043554] [ 2602]     0  2602     1188       47   0       0             0 sudo
[11686.043559] [ 2603]     0  2603   374634   181503   0       0             0 flasherav
[11686.043564] [ 2607]   999  2607    12673      189   0       0             0 indicator-appli
[11686.043569] [ 2609]   999  2609    19313      311   1       0             0 indicator-datet
[11686.043573] [ 2611]   999  2611    15738      225   0       0             0 indicator-messa
[11686.043578] [ 2615]   999  2615    17433      237   1       0             0 indicator-sessi
[11686.043583] [ 2627]   999  2627     2393      132   0       0             0 gvfsd-trash
[11686.043588] [ 2640]   999  2640     1933       85   0       0             0 geoclue-master
[11686.043592] [ 2650]     0  2650     2498     1136   1       0             0 mount.ntfs
[11686.043598] [ 2657]   999  2657     6624      128   1       0             0 telepathy-indic
[11686.043602] [ 2659]   999  2659     2246      112   0       0             0 mission-control
[11686.043607] [ 2662]   999  2662     5431      346   1       0             0 gdu-notificatio
[11686.043612] [ 2664]     0  2664     3716     2392   0       0             0 mount.ntfs
[11686.043617] [ 2679]   999  2679    12453      197   1       0             0 zeitgeist-datah
[11686.043621] [ 2685]   999  2685     5196     1581   1       0             0 zeitgeist-daemo
[11686.043626] [ 2934]   999  2934    16305      710   0       0             0 gnome-terminal
[11686.043631] [ 2938]   999  2938      553        0   0       0             0 gnome-pty-helpe
[11686.043636] [ 2939]   999  2939     1814      406   0       0             0 bash
[11686.043641] Out of memory: Kill process 2603 (flasherav) score 761 or sacrifice child
[11686.043647] Killed process 2603 (flasherav) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB

4 个解决方案

#1


23  

Memory management in Linux is a bit tricky to understand, and I can't say I fully understand it yet, but I'll try to share a little bit of my experience and knowledge.

Linux中的内存管理有点难以理解,我不能说我完全理解它,但是我将尝试分享我的一些经验和知识。

Short answer to your question: Yes there are other stuff included than whats in the list.

对你的问题的简短回答是:是的,除了清单上的其他东西外,还有其他的东西。

What's being shown in your list is applications run in userspace. The kernel uses memory for itself and modules, on top of that it also has a lower limit of free memory that you can't go under. When you've reached that level it will try to free up resources, and when it can't do that anymore, you end up with an OOM problem.

在列表中显示的是在userspace中运行的应用程序。内核为自己和模块使用内存,除此之外,它还有一个更低的*内存限制,您无法进入。当你达到这个水平时,它会试图释放资源,当它不能再这么做时,你就会遇到OOM问题。

From the last line of your list you can read that the kernel reports a total-vm usage of: 1498536kB (1,5GB), where the total-vm includes both your physical RAM and swap space. You stated you don't have any swap but the kernel seems to think otherwise since your swap space is reported to be full (Total swap = 524284kB, Free swap = 0kB) and it reports a total vmem size of 1,5GB.

从列表的最后一行可以看到,内核报告了一个total-vm使用情况:1498536kB (1,5gb),其中total-vm包含物理RAM和交换空间。您声明没有任何交换,但是内核似乎不这么认为,因为您的交换空间被报告为full (Total swap = 524284kB, Free swap = 0kB),并且它报告总的vmem大小为1,5gb。

Another thing that can complicate things further is memory fragmentation. You can hit the OOM killer when the kernel tries to allocate lets say 4096kB of continous memory, but there are no free ones availible.

另一个可能使事情进一步复杂化的是内存碎片。当内核试图分配比如4096kB的连续内存时,您可以使用OOM杀手,但是没有免费的。

Now that alone probably won't help you solve the actual problem. I don't know if it's normal for your program to require that amount of memory, but I would recommend to try a static code analyzer like cppcheck to check for memory leaks or file descriptor leaks. You could also try to run it through Valgrind to get a bit more information out about memory usage.

现在单靠这个可能不会帮助你解决实际的问题。我不知道您的程序需要这么多内存是否正常,但我建议尝试使用cppcheck这样的静态代码分析器检查内存泄漏或文件描述符泄漏。您也可以试着在Valgrind中运行它,以获得关于内存使用的更多信息。

#2


12  

Sum of total_vm is 847170 and sum of rss is 214726, these two values are counted in 4kB pages, which means when oom-killer was running, you had used 214726*4kB=858904kB physical memory and swap space.

total_vm的和是847170,rss的和是214726,这两个值在4kB页面中计数,这意味着在运行oom杀手时,您使用了214726*4kB=858904kB物理内存和交换空间。

Since your physical memory is 1GB and ~200MB was used for memory mapping, it's reasonable for invoking oom-killer when 858904kB was used.

由于您的物理内存是1GB,并且内存映射使用的是~200MB,所以当使用858904kB时,调用oom-killer是合理的。

rss for process 2603 is 181503, which means 181503*4KB=726012 rss, was equal to sum of anon-rss and file-rss.

process 2603的rss是181503,意思是181503*4KB=726012 rss,等于anon-rss和file-rss的总和。

[11686.043647] Killed process 2603 (flasherav) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB

[11686.043647]杀死进程2603 (flasherav) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB

#3


5  

This webpage have an explanation and a solution.

这个网页有一个解释和一个解决方案。

The solution is:

解决方案是:

To fix this problem the behavior of the kernel has to be changed, so it will no longer overcommit the memory for application requests. Finally I have included those mentioned values into the /etc/sysctl.conf file, so they get automatically applied on start-up:

为了解决这个问题,内核的行为必须被修改,这样它就不会再为应用程序请求过度提交内存。最后,我将上述值包含到/etc/sysctl中conf文件,在启动时自动应用:

vm.overcommit_memory = 2

vm。overcommit_memory = 2

vm.overcommit_ratio = 80

vm。overcommit_ratio = 80

#4


0  

Im having hundreds of vms. it is pretty hard to us, to understand every log in a vm where a process got killed.

我有数百个虚拟机。我们很难理解vm中每一个进程被杀死的日志。

for this kind of situation this will help you.

这种情况对你有帮助。

#1


23  

Memory management in Linux is a bit tricky to understand, and I can't say I fully understand it yet, but I'll try to share a little bit of my experience and knowledge.

Linux中的内存管理有点难以理解,我不能说我完全理解它,但是我将尝试分享我的一些经验和知识。

Short answer to your question: Yes there are other stuff included than whats in the list.

对你的问题的简短回答是:是的,除了清单上的其他东西外,还有其他的东西。

What's being shown in your list is applications run in userspace. The kernel uses memory for itself and modules, on top of that it also has a lower limit of free memory that you can't go under. When you've reached that level it will try to free up resources, and when it can't do that anymore, you end up with an OOM problem.

在列表中显示的是在userspace中运行的应用程序。内核为自己和模块使用内存,除此之外,它还有一个更低的*内存限制,您无法进入。当你达到这个水平时,它会试图释放资源,当它不能再这么做时,你就会遇到OOM问题。

From the last line of your list you can read that the kernel reports a total-vm usage of: 1498536kB (1,5GB), where the total-vm includes both your physical RAM and swap space. You stated you don't have any swap but the kernel seems to think otherwise since your swap space is reported to be full (Total swap = 524284kB, Free swap = 0kB) and it reports a total vmem size of 1,5GB.

从列表的最后一行可以看到,内核报告了一个total-vm使用情况:1498536kB (1,5gb),其中total-vm包含物理RAM和交换空间。您声明没有任何交换,但是内核似乎不这么认为,因为您的交换空间被报告为full (Total swap = 524284kB, Free swap = 0kB),并且它报告总的vmem大小为1,5gb。

Another thing that can complicate things further is memory fragmentation. You can hit the OOM killer when the kernel tries to allocate lets say 4096kB of continous memory, but there are no free ones availible.

另一个可能使事情进一步复杂化的是内存碎片。当内核试图分配比如4096kB的连续内存时,您可以使用OOM杀手,但是没有免费的。

Now that alone probably won't help you solve the actual problem. I don't know if it's normal for your program to require that amount of memory, but I would recommend to try a static code analyzer like cppcheck to check for memory leaks or file descriptor leaks. You could also try to run it through Valgrind to get a bit more information out about memory usage.

现在单靠这个可能不会帮助你解决实际的问题。我不知道您的程序需要这么多内存是否正常,但我建议尝试使用cppcheck这样的静态代码分析器检查内存泄漏或文件描述符泄漏。您也可以试着在Valgrind中运行它,以获得关于内存使用的更多信息。

#2


12  

Sum of total_vm is 847170 and sum of rss is 214726, these two values are counted in 4kB pages, which means when oom-killer was running, you had used 214726*4kB=858904kB physical memory and swap space.

total_vm的和是847170,rss的和是214726,这两个值在4kB页面中计数,这意味着在运行oom杀手时,您使用了214726*4kB=858904kB物理内存和交换空间。

Since your physical memory is 1GB and ~200MB was used for memory mapping, it's reasonable for invoking oom-killer when 858904kB was used.

由于您的物理内存是1GB,并且内存映射使用的是~200MB,所以当使用858904kB时,调用oom-killer是合理的。

rss for process 2603 is 181503, which means 181503*4KB=726012 rss, was equal to sum of anon-rss and file-rss.

process 2603的rss是181503,意思是181503*4KB=726012 rss,等于anon-rss和file-rss的总和。

[11686.043647] Killed process 2603 (flasherav) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB

[11686.043647]杀死进程2603 (flasherav) total-vm:1498536kB, anon-rss:721784kB, file-rss:4228kB

#3


5  

This webpage have an explanation and a solution.

这个网页有一个解释和一个解决方案。

The solution is:

解决方案是:

To fix this problem the behavior of the kernel has to be changed, so it will no longer overcommit the memory for application requests. Finally I have included those mentioned values into the /etc/sysctl.conf file, so they get automatically applied on start-up:

为了解决这个问题,内核的行为必须被修改,这样它就不会再为应用程序请求过度提交内存。最后,我将上述值包含到/etc/sysctl中conf文件,在启动时自动应用:

vm.overcommit_memory = 2

vm。overcommit_memory = 2

vm.overcommit_ratio = 80

vm。overcommit_ratio = 80

#4


0  

Im having hundreds of vms. it is pretty hard to us, to understand every log in a vm where a process got killed.

我有数百个虚拟机。我们很难理解vm中每一个进程被杀死的日志。

for this kind of situation this will help you.

这种情况对你有帮助。