linux 再多的running也挡不住锁

时间:2023-03-08 21:54:53

再续《linux 3.10 一次softlock排查》,看运行态进程数量之多:

crash> mach
MACHINE TYPE: x86_64
MEMORY SIZE: GB
CPUS:
HYPERVISOR: KVM
PROCESSOR SPEED: Mhz
HZ:
PAGE SIZE:
KERNEL VIRTUAL BASE: ffff880000000000
KERNEL VMALLOC BASE: ffffc90000000000
KERNEL VMEMMAP BASE: ffffea0000000000
KERNEL START MAP: ffffffff80000000
KERNEL MODULES BASE: ffffffffc0000000
KERNEL STACK SIZE:
IRQ STACK SIZE:
IRQ STACKS:
CPU : ffff880b52a00000
CPU : ffff880b52a40000
CPU : ffff880b52a80000
CPU : ffff880b52ac0000
CPU : ffff880b52b00000
CPU : ffff880b52b40000
CPU : ffff880b52b80000
CPU : ffff880b52bc0000
CPU : ffff880b52c00000
CPU : ffff880b52c40000
CPU : ffff880b52c80000
CPU : ffff880b52cc0000
CPU : ffff881692200000
CPU : ffff881692240000
CPU : ffff881692280000
CPU : ffff8816922c0000
CPU : ffff881692300000
CPU : ffff881692340000
CPU : ffff881692380000
CPU : ffff8816923c0000
CPU : ffff881692400000
CPU : ffff881692440000
CPU : ffff881692480000
CPU : ffff8816924c0000
DOUBLEFAULT STACK SIZE:

但是,我们来看,有多少个running进程呢?

ps |grep -i RU
> ffffffff81a09480 RU 0.0 [swapper/]
ffff88016e798fd0 RU 0.0 [swapper/]
ffff88016e799fa0 RU 0.0 [swapper/]
ffff88016e79af70 RU 0.0 [swapper/]
ffff88016e79bf40 RU 0.0 [swapper/]
ffff88016e79cf10 RU 0.0 [swapper/]
ffff88016e79dee0 RU 0.0 [swapper/]
ffff88016e79eeb0 RU 0.0 [swapper/]
> ffff88016e008000 RU 0.0 [swapper/]
ffff88016e008fd0 RU 0.0 [swapper/]
ffff88016e009fa0 RU 0.0 [swapper/]
> ffff88016e00af70 RU 0.0 [swapper/]
> ffff880beeaceeb0 RU 0.0 [swapper/]
ffff880beeacdee0 RU 0.0 [swapper/]
ffff880beeaccf10 RU 0.0 [swapper/]
> ffff880beeacbf40 RU 0.0 [swapper/]
ffff880beeacaf70 RU 0.0 [swapper/]
ffff880beeac9fa0 RU 0.0 [swapper/]
ffff880beeb10000 RU 0.0 [swapper/]
ffff880beeb10fd0 RU 0.0 [swapper/]
ffff880beeb11fa0 RU 0.0 [swapper/]
ffff880beeb12f70 RU 0.0 [swapper/]
ffff880beeb13f40 RU 0.0 [swapper/]
ffff880beeb14f10 RU 0.0 [swapper/]
ffff880beeac8000 RU 0.0 systemd
ffff88016e726eb0 RU 0.0 [rcu_sched]
ffff88016e00eeb0 RU 0.0 [watchdog/]
ffff88016e00cf10 RU 0.0 [ksoftirqd/]
ffff88016e040fd0 RU 0.0 [watchdog/]
ffff88016e045ee0 RU 0.0 [watchdog/]
ffff88016e098000 RU 0.0 [ksoftirqd/]
ffff88016e09af70 RU 0.0 [watchdog/]
ffff88016e111fa0 RU 0.0 [watchdog/]
ffff88016e163f40 RU 0.0 [watchdog/]
ffff88016e198fd0 RU 0.0 [watchdog/]
ffff880beeb3bf40 RU 0.0 [watchdog/]
ffff880beeb7dee0 RU 0.0 [watchdog/]
ffff880beebaaf70 RU 0.0 [watchdog/]
ffff880beebf0000 RU 0.0 [watchdog/]
ffff880beebf4f10 RU 0.0 [watchdog/]
ffff880bee021fa0 RU 0.0 [watchdog/]
ffff880bee026eb0 RU 0.0 [watchdog/]
ffff880bee050fd0 RU 0.0 [ksoftirqd/]
ffff880bee053f40 RU 0.0 [watchdog/]
ffff880bee078fd0 RU 0.0 [watchdog/]
ffff880b06e15ee0 RU 0.5 zte_uep1
ffff88077fe08000 RU 0.5 zte_uep1
ffff8814af3fbf40 RU 0.5 zte_uep1
ffff8807967d6eb0 RU 0.5 java
ffff880b4f5b5ee0 RU 0.0 systemd-journal
ffff8809a4aacf10 RU 0.5 zte_uep1
ffff880036a6cf10 RU 0.5 zte_uep1
ffff880036022f70 IN 0.0 [xfs_mru_cache]
。。。。。。。。。。。。。。。。。。。。列不完

看看个数:

crash> ps |grep -w RU |wc -l

经常敲top的人肯定知道,running的进程数,一般是小于等于cpu的核数的,那为啥这个机器上这么多的running呢?为了排除是crash工具的问题,在从runqueue的角度来看下:

crash> p runqueues
PER-CPU DATA TYPE:
struct rq runqueues;
PER-CPU ADDRESSES:
[0]: ffff880b52a17c00
[1]: ffff880b52a57c00
[2]: ffff880b52a97c00
[3]: ffff880b52ad7c00
[4]: ffff880b52b17c00
[5]: ffff880b52b57c00
[6]: ffff880b52b97c00
[7]: ffff880b52bd7c00
[8]: ffff880b52c17c00
[9]: ffff880b52c57c00
[10]: ffff880b52c97c00
[11]: ffff880b52cd7c00
[12]: ffff881692217c00
[13]: ffff881692257c00
[14]: ffff881692297c00
[15]: ffff8816922d7c00
[16]: ffff881692317c00
[17]: ffff881692357c00
[18]: ffff881692397c00
[19]: ffff8816923d7c00
[20]: ffff881692417c00
[21]: ffff881692457c00
[22]: ffff881692497c00
[23]: ffff8816924d7c00
crash> struct rq.nr_running ffff880b52a17c00
nr_running = 7
crash> struct rq.nr_running ffff880b52a57c00
nr_running = 152

挨个查看一下nr_running个数,然后脚本统计如下:

grep 'nr_running =' 1.txt |awk '{print $3}BEGIN{sum=0;num=0}{sum+=$3;num+=1}END{printf "%d\n",sum}'
7
152
113
138
65
4
4
38
4
148
43
1
1
141
4
43
107
102
91
110
108
105
98
3
1630-------------总数

为什么这个总数和我们ps看到的RU状态的总数不一致呢?一开始还以为是percpu统计的瞬时值问题,但是这两个总数的相差:1654-1630=24,刚好和cpu的核数一致。

ps |grep -w 'RU'|grep 'swapper'
> ffffffff81a09480 RU 0.0 [swapper/]
ffff88016e798fd0 RU 0.0 [swapper/]
ffff88016e799fa0 RU 0.0 [swapper/]
ffff88016e79af70 RU 0.0 [swapper/]
ffff88016e79bf40 RU 0.0 [swapper/]
ffff88016e79cf10 RU 0.0 [swapper/]
ffff88016e79dee0 RU 0.0 [swapper/]
ffff88016e79eeb0 RU 0.0 [swapper/]
> ffff88016e008000 RU 0.0 [swapper/]
ffff88016e008fd0 RU 0.0 [swapper/]
ffff88016e009fa0 RU 0.0 [swapper/]
> ffff88016e00af70 RU 0.0 [swapper/]
> ffff880beeaceeb0 RU 0.0 [swapper/]
ffff880beeacdee0 RU 0.0 [swapper/]
ffff880beeaccf10 RU 0.0 [swapper/]
> ffff880beeacbf40 RU 0.0 [swapper/]
ffff880beeacaf70 RU 0.0 [swapper/]
ffff880beeac9fa0 RU 0.0 [swapper/]
ffff880beeb10000 RU 0.0 [swapper/]
ffff880beeb10fd0 RU 0.0 [swapper/]
ffff880beeb11fa0 RU 0.0 [swapper/]
ffff880beeb12f70 RU 0.0 [swapper/]
ffff880beeb13f40 RU 0.0 [swapper/]
ffff880beeb14f10 RU 0.0 [swapper/]

原来是因为idle进程默认在running态,但是并不算到nr_running的计数中:

crash> ps |grep -w 'RU' |grep -w '12'
> 0 0 12 ffff880beeaceeb0 RU 0.0 0 0 [swapper/12]
12 2 1 ffff88016e00eeb0 RU 0.0 0 0 [watchdog/1]-----------这个不是12cpu的进程
22237 1 12 ffff881254b5eeb0 RU 1.0 14321692 912420 1_scheduler
crash> struct rq.nr_running ffff881692217c00
nr_running = 1

为啥running这么多呢?查看处于running状态的task,堆栈都是 __schedule ,也就是就绪队列很长,

来挑一些看看堆栈:

crash> bt
PID: TASK: ffff8809d42daf70 CPU: COMMAND: "java"
# [ffff88099eb93d58] __schedule at ffffffff816b6165
# [ffff88099eb93dc0] schedule at ffffffff816b6ae9
# [ffff88099eb93dd0] schedule_hrtimeout_range_clock at ffffffff816b5852
# [ffff88099eb93e68] schedule_hrtimeout_range at ffffffff816b5903
# [ffff88099eb93e78] ep_poll at ffffffff812526de
# [ffff88099eb93f30] sys_epoll_wait at ffffffff81253b9d
# [ffff88099eb93f80] system_call_fastpath at ffffffff816c2789
RIP: 00007ffa43a652c3 RSP: 00007ffa2595b818 RFLAGS:
RAX: 00000000000000e8 RBX: ffffffff816c2789 RCX: 00000000e14ef930
RDX: RSI: 00000000047a5000 RDI:
RBP: 00007ffa2595b630 R8: R9:
R10: R11: R12: 0000016547341bca
R13: R14: R15: 00000000047a5000
ORIG_RAX: 00000000000000e8 CS: SS: 002b
crash> bt
PID: TASK: ffff8809cbe03f40 CPU: COMMAND: "java"
# [ffff880a4eedbd58] __schedule at ffffffff816b6165
# [ffff880a4eedbdc0] schedule at ffffffff816b6ae9
# [ffff880a4eedbdd0] schedule_hrtimeout_range_clock at ffffffff816b5852
# [ffff880a4eedbe68] schedule_hrtimeout_range at ffffffff816b5903
# [ffff880a4eedbe78] ep_poll at ffffffff812526de
# [ffff880a4eedbf30] sys_epoll_wait at ffffffff81253b9d
# [ffff880a4eedbf80] system_call_fastpath at ffffffff816c2789
RIP: 00007f7bb9bb72c3 RSP: 00007f7b8f060510 RFLAGS:
RAX: 00000000000000e8 RBX: ffffffff816c2789 RCX: 00000000ffffffff
RDX: RSI: 0000000005be3000 RDI:
RBP: 00007f7b8f060310 R8: R9:
R10: 00000000000003e8 R11: R12:
R13: R14: 00000000000003e8 R15: 0000000005be3000
ORIG_RAX: 00000000000000e8 CS: SS: 002b
crash> task_struct.state ffff8809cbe03f40
state =

看看状态值的宏:

#define TASK_RUNNING        0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define __TASK_STOPPED 4
#define __TASK_TRACED 8

再看看占有cpu的进程的堆栈,发现基本都在等锁,原本大家很快都能执行完,然后放到等待队列里面去,但是由于前面某个进程执行占着cpu不放,那么因为等待资源而得到唤醒的,处于就绪队列里面的进程自然得不到调度。当然这个是非抢占式内核的特点,高优先级的进程,也得等低优先级的进程主动放弃cpu,哪怕你已经在就绪队列,哪怕你的状态已经被改成了running,但就得等着。