Linux 性能优化和内核观察 - CPU 篇（一）

简介

*处理器（central processing unit，简称 CPU）作为计算机系统的运算和控制核心，是信息处理、程序运行的最终执行单元。所有的计算机程序都运行在CPU之上，在大多数情况下CPU都是性能分析的重要关注点。

CPU 基础知识

CPU 的运行模式

CPU 和其它硬件资源都是由系统内核（Kernel）管理的，系统内核运行在一个特殊的模式系统模式（System mode）下。用户态应用程序运行于用户模式（User mode）下，运行在用户模式下的应用程序只能访问受限的资源，不能直接访问内存等硬件设备，只能向系统内核发一些请求来访问这些特权资源。这些系统请求包括显式请求，例如，系统调用（Syscall）和隐式请求，例如：由访问内存导致的缺页中断（Page fault）。

CPU 上下文切换

现代操作系统可以同时执行多个任务，但这些任务并不是同时执行的，而是系统内核在很短的时间内将 CPU 轮流分配给每个任务，由于任务轮流执行依赖于 CPU 寄存器和程序计数器，通过它们暂存指令，数据和地址，以及程序运行的下一条地址，因此也被称为 CPU 上下文。

CPU 上下文切换就是把上一个任务的 CPU 上下文保存起来，然后加载新任务的上下文到寄存器与程序计数器，加载完成后跳转到程序计数器所指的新位置，运行新任务。被保运起来的上下文会存储在系统内核中，并在任务执行时再次加载进来。

每次上下文切换需要几十纳秒到几微秒的时间，频繁上下文会导致 CPU 将大量时间耗费在寄存器，内核栈及虚拟内存等资源保存恢复上，缩短真正运行时间。当上下文次数超过1万次或出现爆发式增涨时，系统会出现性能问题。

进程上下文切换

进程调度时会触发上下文切换，Linux 为每个 CPU 都为维护了一个就绪队列，将活跃进程（正在运行和正在等待 CPU 的进程）按照优先级和等待 CPU 的时间排序，然后选择最需要 CPU 的进程，也就是优先级最高和等待 CPU 时间最长的进程来运行。

触发进程上下文切换的事件：

为了保证所有进程可以得到公平的调度，CPU 时间被划分为一段段的时间片，这些时间片再被轮流分配给各个进程。这样，当某个进程的时间片耗尽，就会被系统挂起，切换到其它正在等待 CPU 的进程运行。（被动脱离）
保证进程在系统资源不足（如内存不足）时，要等到资源满足之后才可以运行，这个时候进程也会被挂起，并由系统调度其它进程运行。（主动脱离）
当进程通过睡眠函数（sleep）这样的方法将自己主动挂起时，系统重新调度其它进程运行。（主动脱离）
当有更高优先级的进程需要运行时，为了保证高优先级进程的运行，当前进程会被挂起，系统调度高优先级进程运行。（被动脱离）
发生中断时，CPU 上的进程会被中断挂起，转而执行中断服务程序。（被动脱离）

线程上下文切换

线程是系统调度的基本单位，进程是资源拥有的基本单位，为线程提供虚拟内存，全局变量等资源。线程上下文切换分为两种情况：

两个线程属于不同进程，此时进程资源不共享，线程上下文切换等于进程上下文切换。
两个线程属于同个进程，此时进程资源共享，在线程切换时进程资源不变，只切换线程私有数据，此时线程上下文切换消耗资源比进程少。

中断上下文切换

由于硬件（鼠标、键盘）优先级高于进程优先级，收到硬件指令后会打断正在执行的进程，转而执行硬件指令，这种情况被称为中断上下文切换。

中断上下文切换不涉及进程用户态，发生中断上下文切换时不用保存和恢复进程的虚拟内存，全局变量等，性能消耗会比进程上下文切换低。

中断

中断是指计算机运行过程中，出现某些意外情况需系统干预时，内核能够自动停止正在运行的程序并转入处理新情况的程序，处理完毕后又返回原被暂停的程序继续运行。中断包括硬中断和软中断。

整个操作系统可理解为一个由中断驱动的死循环。

中断分为硬中断和软中断，其分类依据是实现机制，而不是触发机制，比如 CPU 硬中断，它是由 CPU 这个硬件实现的中断机制，但它的触发可以通过外部硬件，也可以通过软件的 INT 指令。

类似地，软中断是由软件实现的中断，是纯粹由软件实现的一种类似中断的机制，实际上是模仿硬件，在内存中存储着一组软中断的标志位，然后由内核的一个守护线程不断轮询这些标志位，如果有哪个标志位有效，则再去执行这个软中断对应的中断处理程序。

平均负载

系统处于可运行状态和不可中断状态的平均进程数，也就是平均活跃进程数，它和 CPU 使用率并没有直接关系。

可运行状态的进程，是指正在使用 CPU 或者正在等待 CPU 的进程，也就是我们常用 ps 命令看到的，处于 R 状态（Running 或 Runnable）的进程。

不可中断状态的进程则是正处于内核态关键流程中的进程，并且这些流程是不可打断的，比如最常见的是等待硬件设备的 I/O 响应，也就是我们在 ps 命令中看到的 D 状态进程。

进程状态

状态	说明
R	Running 或 Runnable 的缩写，表示进程在 CPU 的就绪队列中，正在运行或正在等待运行
D	Disk Sleep 的缩写，表示不可中断睡眠（Uninterruptible Sleep），一般是进程正在跟硬件交互，并且交互工程中不允许被其它进程或中断打断
Z	Zombie 的缩写，表示僵尸进程，也就是进程实际已经结束，但是父进程还没有回收它的资源
S	Interruptible Sleep 的缩写，表示可中断睡眠，通常是进程因为等待某个事件而被系统挂起。当进程等待的事件发生时，它会被唤醒并进入 R 状态
I	Idle 的缩写，表示空闲状态，用在不可中断睡眠的内核线程上。硬件交互导致的不可中断进程使用 D 表示，但对某些内核线程来说，它们可能没有任何负载，用 Idle 就是为了区分这种情况 `注意：D 状态的进程会导致平均负载升高，I 状态的进程不会`
T/t	Stopped 或 Traced 的缩写，表示进程厨艺暂停或者跟踪状态。向一个进程发生 SIGSTOP 信号，它就会进入暂停状态（Stopped）。使用 DEBUG 工具（如gdb）调试一个进程时它就会进入跟踪状态（Traced），这其实也是一种特殊的暂停状态，只不过你可以使用调试器来跟踪并按需要控制进程运行
X	Dead 的缩写，表示进程已经消亡，所以不会在 ps 或者 top 中看到这个状态的进程

CPU 性能分析和优化策略

性能分析策略

在花时间使用分析工作之前，先保证待分析对象处于 CPU 运行状态。检查系统中整体的 CPU 利用率，并保证每个 CPU 都处于在线状态。
确保系统负载确实受限于 CPU。

系统中所有 CPU 的使用率是否都很高，还是仅某个 CPU 使用率高。
检查系统中运行队列的延迟。

先量化整个系统中的 CPU 使用量的百分比，然后再按进程，CPU 模式，CPU ID 来分解，可以通过某种模式或者某个 CPU 的高使用率情况查找某个进程。

如果系统时间占用高，那么可以按照进程和系统调用类型来统计系统调用的频率和数量，同时检查系统调用的参数来识别值得优化的地方。

使用性能剖析器来采样应用程序的调用栈信息，在利用 CPU 火焰图来展示。很多 CPU 问题都可以通过检查火焰图来分析。
针对某个 CPU 使用率高的任务，可以考虑开发一些定制工具来获取更多的上下文信息。性能剖析器通常可以展示哪些函数正在运行。但是不能展示调用参数和函数内部信息，理解 CPU 用量时可能需要这些信息。

内核模式：如果某个文件系统针对文件进行 stat() 消耗了很多 CPU 资源，那么文件名是什么？
用户模式：如果某个应用程序忙于处理请求，那么这些请求到底是什么？

测量硬中断的资源消耗，这些信息可能对基于定时器的分析工具不可见。
选择合适的工具进行对系统进行分析。
利用 PMC 来测量每时钟周期内 CPU 指令执行量（IPC），以理解宏观层面 CPU 的阻塞情况。

PMC

性能监控计数器(Performance monitoring counter, PMC)，是处理器上的硬件可编程计数器。PMC 数量众多，是性能分析领域至关重要的资源。只有通过 PMC 才能测量 CPU 指令执行的效率、CPU 缓存的命中率、内存/数据互联和设备总线的利用率，以及阻塞的指令周期等。在性能分析方面使用这些方法可以进行各种细微的性能优化。

尽管有数百个可用的 PMC，但在同一时刻 CPU 只允许固定数量的寄存器(可能只有6个)进行读取。在实现中需要选择通过这6个寄存器来读取哪些 PMC，或者可以以循环采样的方式覆盖多个 PMC 集合(Linux 中的perf()工具可以自动支持这种循环采样)。其它软件类计数器则没有这种限制。

PMC 工作模式

计数：在此模式下，PMC 能够跟踪事件发生的频率。只要内核有需要，就可以随时读取，比如每秒获取1次。这种模式的开销几乎为零。
溢出采样：在此模式下，PMC 在所监控的事件发生到一定次数时通知内核，这样内核可以获取额外的状态。监控的事件可能会以每秒百万、亿级别的频率发生，如果每次事件都进行中断会导致系统性能下降到不可用。解决方案是利用一个可编程的计数器进行采样，具体来说，是当计数器溢出时就向内核发信号(比如，每 10000 次 LLC 缓存未命中事件，或者每 100 万次阻塞的指令时钟周期)。

采样模式对 BPF 跟踪来说更值得关注，因为它产生的事件给 BPF 程序提供了执行的时机。BCC 和 bpfrace 都支持 PMC 事件跟踪。

PEBS

由于存在中断延迟（通常被称为打滑）或者乱序执行，溢出采样可能不能正确地记录触发事件发生时的指令指针。对于 CPU 周期性能分析来说，这类打滑可能不是什么问题，而且有些性能分析器会故意在采样周期中引入一些微小的不规则性，避免锁步采样(lockstep sampling) (或者使用一个自带偏移量的采样频率，例如，99Hz)。但是对于测量另外一些事件来说，比如 LLC 的未命中率，这些采样的指令指针就必须是精确的。

Intel 开发了一种解决方案，叫作精确事件采样(precise event-based sampling,PEBS)。PEBS 使用硬件缓冲区来记录 PMC 事件发生时正确的指令指针。Linux 的 perf_events 框架机制支持PEBS，perf_events 是 perf 命令所依赖的采样和跟踪机制。

可以使用perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-store-misses 利用 PMC 查看 CPU LLC 命中率。

性能优化策略

中断负载均衡

把中断均衡到不同的 CPU，可以分摊中断处理的 CPU 压力，提升 CPU 的总体利用率，也避免了某个 CPU 上的性能瓶颈。

应用程序的 CPU 绑定

把进程绑定到一个或多个cpu，可以减少跨 CPU 调度带来的上下文切换带来的性能问题。

调整进程优先级

从某个进程视角来看，该进程的优先级越高，那么它的 CPU 调度越及时，执行延迟也就越少。

尽量用多线程，而不是多进程

进程是资源拥有的基本单位，进程之间的上下文切换开销不可小觑。进程上下文切换除了 CPU 通用寄存器、PC 指针外，还包含了内核栈、全局变量、虚拟地址空间、TLB 缓存切换（失效）等。而同一个进程的线程们由于共享进程的虚拟地址空间，所以调度时的切换开销就比较低，这样也更能提升性能。

使用异步处理，而不是同步处理

同步往往伴随着阻塞，阻塞意味着当前进程要放弃 CPU，从而引发新一轮的进程调度，这不仅会让后面的代码不能及时执行，也会引入进程调度的开销；而异步则不会影响当前进程的执行过程，从而提升程序的运行效率和并发处理能力。

传统工具

工具	介绍
uptime	展示系统平均负载和系统运行时间
top	按进程展示系统资源使用情况
mpstat	按每个 CPU 展示 CPU 使用情况
vmstat	系统整体资源使用情况
pidstat	按进程展示 CPU 使用情况
perf	定时采样调用栈信息、事件统计、PMC 跟踪、跟踪点、USDT probes、kprobes 以及 uprobes 等
ftrace	汇报内核函数调用统计、kprobes 和 uprobes 事件跟踪

uptime

# uptime 展示了系统时间，系统启动以来运行的时间，当前登陆用户数，系统平静负载
$ uptime
20:04:02 up 1 day,  2:30,  1 user,  load average: 0.20, 0.10, 0.09

pidstat

# 每隔 1 秒展示进程使用 CPU 的情况，输出 5 次
$ pidstat 1 5
Linux 5.15.0-48-generic (ebpf)  11/08/22        _x86_64_        (8 CPU)

08:11:26 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
08:11:27 PM     0       545    0.00    0.98    0.00    0.98    0.98     6  systemd-udevd
08:11:27 PM     0      1066    0.98    0.00    0.00    0.00    0.98     0  libvirtd
08:11:27 PM     0      2456    0.98    0.00    0.00    0.00    0.98     7  orion-nvidia-mo
08:11:27 PM     0      2990    0.00    0.98    0.00    0.00    0.98     7  pidstat
08:11:27 PM     0      5803    0.00    0.98    0.98    0.00    1.96     4  qemu-system-x86
......

# 按进程输出上下文切换信息
$ pidstat -w 1
Linux 5.15.0-48-generic (ebpf)  11/08/22        _x86_64_        (8 CPU)

09:04:25 PM   UID       PID   cswch/s nvcswch/s  Command
09:04:26 PM     0         1      0.98      0.00  systemd
09:04:26 PM     0         8     23.53      0.00  rcu_sched
09:04:26 PM     0       129      0.98      0.00  kworker/3:1H
09:04:26 PM     0       136      0.98      0.00  kworker/7:1H
09:04:26 PM     0       528      1.96      0.00  systemd-journal

# 输出线程的上下文切换信息
$ pidstat -wt 1
Linux 5.15.0-48-generic (ebpf)  11/08/22        _x86_64_        (8 CPU)

09:04:42 PM   UID      TGID       TID   cswch/s nvcswch/s  Command
09:04:43 PM     0      5803         -      0.97      0.00  qemu-system-x86
09:04:43 PM     0         -      5803      0.97      0.00  |__qemu-system-x86
09:04:43 PM     0         -      5819      0.97      0.00  |__CPU 0/KVM
09:04:43 PM     0         -      5821     11.65      0.00  |__CPU 1/KVM
09:04:43 PM     0         -      5822      6.80      0.00  |__CPU 2/KVM
09:04:43 PM     0         -      5823      8.74      3.88  |__CPU 3/KVM

mpstat

# 展示所有 CPU 的使用情况, 每隔 1 秒输出一次
$ mpstat -P ALL 1
Linux 5.15.0-48-generic (ebpf)  11/08/22        _x86_64_        (8 CPU)

08:35:44 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
08:35:45 PM  all    0.12    0.00    0.25    0.00    0.00    0.00    0.00    0.00    0.00   99.62
08:35:45 PM    0    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
08:35:45 PM    1    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
08:35:45 PM    2    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
08:35:45 PM    3    0.00    0.00    1.00    0.00    0.00    0.00    0.00    0.00    0.00   99.00
08:35:45 PM    4    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
08:35:45 PM    5    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
08:35:45 PM    6    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00  100.00
08:35:45 PM    7    1.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00   99.00
......

top

$ top
top - 20:07:03 up 1 day,  2:33,  1 user,  load average: 0.37, 0.17, 0.11
Tasks: 203 total,   1 running, 103 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.5 us,  0.4 sy,  0.0 ni, 99.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 32939204 total, 27652956 free,  1758876 used,  3527372 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 30752904 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                                                                                                                                               
32565 root      20   0   42952   4132   3388 R   6.2  0.0   0:00.02 top                                                                                                                                                                                                                                   
    1 root      20   0  225544   9324   6656 S   0.0  0.0   3:09.49 systemd                                                                                                                                                                                                                               
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.20 kthreadd                                                                                                                                                                                                                              
    4 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H                                                                                                                                                                                                                          
    6 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq                                                                                                                                                                                                                          
    7 root      20   0       0      0      0 S   0.0  0.0   0:00.95 ksoftirqd/0                                                                                                                                                                                                                           
    8 root      20   0       0      0      0 I   0.0  0.0   0:36.17 rcu_sched

vmstat

# -Sm 输出对其，每 1 秒输出一次
$ vmstat -Sm 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0  27861    888   2921    0    0     3    21    5   18  0  0 99  0  0
 0  0      0  27861    888   2921    0    0     0     0  300  499  0  0 100  0  0
 0  0      0  27861    888   2921    0    0     0     0  363  660  0  0 100  0  0
 0  0      0  27860    888   2921    0    0     0  1672 1911 3871  1  1 98  0  0
 ....

工具输出单位解释

单位	来源	描述
%usr	`pidstat`	用户态 CPU 使用率
%system	pidstat	内核态 CPU 使用率
%guest	pidstat	运行虚拟机 CPU 使用率
%wait	pidstat	等待 CPU 使用率
%CPU	pidstat	总的 CPU 使用率
CPU	pidstat	CPU 编号
cswch/s	pidstat	每秒自愿上下文切换的次数
nvcswch/s	pidstat	每秒非自愿上下文切换的次数
TGID	pidstat	thread group id 的缩写，用来在 linux 内核层面区分进程和线程。任何一个进程，如果只有主线程，那 pid 就等于 tgid；但是，如果一个进程 A 创建了其它线程 B，那么线程 B 有自己的 pid，而线程 B 的 tgid 就是进程Ａ的 pid
TID	pidstat	thread id 的缩写，表示线程 id

us	`top`	user 的缩写，表示用户态 CPU 时间，注意，它不包括的 nice 时间，但是包括了 guest 时间
sy	top	system 的缩写，表示内核态 CPU 时间
ni	top	nice 的缩写，表示低优先级用户态 CPU 时间，也就是进程的 nice 值被调整为 1-19之间的 CPU 时间。nice 的取值为 -20 到 19，数字越到优先级越低
id	top	idle 的缩写，表示空闲时间。注意：它不包括等待 I/O 的时间
wa	top	iowait 的缩写，表示等待 I/O 的时间
hi	top	irq 的缩写，表示处理硬中断的时间
si	top	softirq 的缩写，表示处理软中断的时间
st	top	steal 的缩写，表示当系统运行在虚拟机中的时候，被其它虚拟机占用的 CPU 时间（也称之为被偷取的时间）
PR	top	priority 的缩写，表示调度优先级，PR 的值等于 NI 的值加 20，即 PR=NI+20
NI	top	nice 的缩写，表示切换优先级，数字越小优先级越高，取值为 -20 到 19，0 表示正常任务
VIRT	top	虚拟内存大小（申请的内存大小）
RES	top	常驻内存大小（实际使用的物理内存）
SHR	top	共享内存大小
S	top	进程状态
CPU	top	CPU 使用率
MEM	top	内存使用率
TIME	top	进程启动时间
COMMAND	top	启动命令

r	`vmstat`	就绪队列的长度
b	vmstat	处于不可中断睡眠状态的进程数
si	vmstat	Swap 换入的内存页
so	vmstat	Swap 换出的内存页
bi	vmstat	收到块设备的 block 数目
Bo	vmstat	发送到块设备的 block 数目
in	vmstat	每秒中断的次数，IO繁忙的系统（如数据库），进程经常性等待 IO 中断进入内核态运行，所以设备中断比较高
cs	vmstat	每秒上下文切换的次数，CPU 繁忙的系统（如大数据系统）上下文切换较多，可以调整进程/线程数比或者通过进程复用的方法来降低上下文切换总量

gust	`mpstat`	代表通过虚拟化运行其它操作系统的时间，也就是运行虚拟机的 CPU 时间
guest_nice	mpstat	代表以低优先级运行的虚拟机的事件

bpf 工具

工具	来源	目标	描述
execsnoop	BCC	调度	列出新进程的运行信息
exitsnoop	BCC	调度	列出进程运行时长和退出原因
runqlat	BCC	调度	统计 CPU 运行队列的延迟信息
runqlen	BCC	调度	统计 CPU 运行队列的长度
runqslower	BCC	调度	当运行队列中等待时长超过阈值时打印
cpudist	BCC	调度	统计在 CPU 上运行的时间
profile	BCC	CPU	采样 CPU 运行的调用栈信息
offcputime	BCC	调度	统计线程脱离 CPU 时的跟踪信息和等待时常
syscount	BCC	系统调用	按类型和进程统计系统调用次数
argdist	BCC	系统调用	可以用来进行系统调用分析
trace	BCC	系统调用	可以用来进行系统调用分析
funccount	BCC	软件	统计函数调用次数
softirqs	BCC	中断	统计软中断时间
hardirqs	BCC	中断	统计硬中断时间
llcstat	BCC	PMC	按进程统计 LLC 命中率

execsnoop

$ execsnoop-bpfcc --help
usage: execsnoop-bpfcc [-h] [-T] [-t] [-x] [--cgroupmap CGROUPMAP] [--mntnsmap MNTNSMAP] [-u USER] [-q] [-n NAME] [-l LINE] [-U] [--max-args MAX_ARGS]

Trace exec() syscalls

options:
  -h, --help            show this help message and exit
  -T, --time            include time column on output (HH:MM:SS)
  -t, --timestamp       include timestamp on output
  -x, --fails           include failed exec()s
  --cgroupmap CGROUPMAP
                        trace cgroups in this BPF map only
  --mntnsmap MNTNSMAP   trace mount namespaces in this BPF map only
  -u USER, --uid USER   trace this UID only
  -q, --quote           Add quotemarks (") around arguments.
  -n NAME, --name NAME  only print commands matching this name (regex), any arg
  -l LINE, --line LINE  only print commands where arg contains this line (regex)
  -U, --print-uid       print UID column
  --max-args MAX_ARGS   maximum number of arguments parsed and displayed, defaults to 20

# 案例1: 跟踪 docker 容器创建过程
$ execsnoop-bpfcc -t -U
TIME(s) UID   PCOMM            PID    PPID   RET ARGS
33.147  0     docker           3750   2907     0 /usr/bin/docker run -itd --name test 10.10.100.228/base/ubuntu:18.04 bash
33.290  0     networkctl       3761   573      0 /usr/bin/networkctl list --no-pager --no-legend
33.300  0     networkctl       3763   573      0 /usr/bin/networkctl status --no-pager --no-legend -- veth15892c8
33.308  0     net-interface-h  3767   3766     0 /lib/open-iscsi/net-interface-handler start
33.309  0     net-interface-h  3768   3765     0 /lib/open-iscsi/net-interface-handler start
33.311  0     systemd-sysctl   3769   3766     0 /lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/veth15892c8 --prefix=/net/ipv4/neigh/veth15892c8 --prefix=/net/ipv6/conf/veth15892c8 --prefix=/net/ipv6/neigh/veth15892c8
33.312  0     systemd-sysctl   3770   3765     0 /lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/veth21bc2bd --prefix=/net/ipv4/neigh/veth21bc2bd --prefix=/net/ipv6/conf/veth21bc2bd --prefix=/net/ipv6/neigh/veth21bc2bd
33.325  0     chrony-onofflin  3783   573      0 /usr/lib/networkd-dispatcher/off.d/chrony-onoffline
33.327  0     chronyc          3784   3783     0 /usr/bin/chronyc onoffline
33.369  0     containerd-shim  3786   589      0 /usr/bin/containerd-shim-runc-v2 -namespace moby -address /run/containerd/containerd.sock -publish-binary /usr/bin/containerd -id 536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72 start
33.479  0     containerd-shim  3793   3786     0 /usr/bin/containerd-shim-runc-v2 -namespace moby -id 536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72 -address /run/containerd/containerd.sock
33.494  0     runc             3802   3793     0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72/log.json --log-format json --systemd-cgroup create --bundle /run/containerd/io.containerd.runtime.v2.task/moby/536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72/init.pid --console-socket /tmp/pty4187296817/pty.sock 536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72
33.509  0     exe              3811   3802     0 /proc/self/exe init
33.559  0     exe              3819   3802     0 /proc/826/exe -exec-root=/var/run/docker 536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72 62444a2d7a3a
33.650  0     exe              3827   826      0 /proc/self/exe /var/run/docker/netns/008853cad115 all false
33.757  0     net-interface-h  3835   3776     0 /lib/open-iscsi/net-interface-handler stop
33.775  0     networkctl       3836   573      0 /usr/bin/networkctl status --no-pager --no-legend -- docker0
33.797  0     chrony-onofflin  3837   573      0 /usr/lib/networkd-dispatcher/routable.d/chrony-onoffline
33.799  0     chronyc          3838   3837     0 /usr/bin/chronyc onoffline
33.817  0     runc             3841   3793     0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72/log.json --log-format json --systemd-cgroup start 536d570e5061c62f01bec7f8d4d46410d2d535c8ca195b27aee4a0c25c5c0c72
33.835  0     bash             3813   3793     0 /bin/bash
33.891  0     groups           3848   3847     0 /usr/bin/groups
33.903  0     dircolors        3850   3849     0 /usr/bin/dircolors -b

# 在另一个终端启动一个 docker 容器
$ docker run -itd --name test ubuntu:18.04 bash

exitsnoop

$ exitsnoop-bpfcc --help
usage: exitsnoop-bpfcc [-h] [-t] [--utc] [-p PID] [--label LABEL] [-x] [--per-thread]

Trace all process termination (exit, fatal signal)

options:
  -h, --help         show this help message and exit
  -t, --timestamp    include timestamp (local time default)
  --utc              include timestamp in UTC (-t implied)
  -p PID, --pid PID  trace this PID only
  --label LABEL      label each line
  -x, --failed       trace only fails, exclude exit(0)
  --per-thread       trace per thread termination

# 删除刚才创建的容器 docker rm -f test
$ exitsnoop-bpfcc
PCOMM            PID    PPID   TID    AGE(s)  EXIT_CODE 
runc             3905   3793   3905   0.01    0
bash             3813   3793   3813   544.27  signal 9 (KILL)
runc             3912   3793   3912   0.01    0
containerd-shim  3793   1      3793   544.36  0
runc             3928   3919   3928   0.01    0
containerd-shim  3919   589    3919   0.01    0
net-interface-h  3936   3934   3936   0.00    0
systemd-sysctl   3937   3934   3937   0.01    0
modprobe         3939   3703   3939   0.00    code 1
modprobe         3940   3703   3940   0.00    code 1
networkctl       3935   573    3935   0.04    0
net-interface-h  3945   3942   3945   0.00    0
net-interface-h  3946   3943   3946   0.00    0
networkctl       3944   573    3944   0.01    code 1
chronyc          3948   3947   3948   0.00    0
chrony-onofflin  3947   573    3947   0.00    0
networkctl       3950   573    3950   0.01    0
docker           3897   2907   3897   0.28    0
ls               3953   2907   3953   0.01    0
systemd-udevd    3942   420    3942   3.52    0
systemd-udevd    3941   420    3941   3.52    0
systemd-udevd    3934   420    3934   3.57    0
systemd-udevd    3943   420    3943   3.53    0

runqlat

# runqlat 用于统计运行队列延迟。在需求超过供给 CPU 资源处于饱和状态时，这个工具可以用了识别和量化问题的严重性
# runqlat 统计的信息是每个线程(任务)等待 CPU 的耗时
$ runqlat-bpfcc --help
usage: runqlat-bpfcc [-h] [-T] [-m] [-P] [--pidnss] [-L] [-p PID] [interval] [count]

Summarize run queue (scheduler) latency as a histogram

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

options:
  -h, --help          show this help message and exit
  -T, --timestamp     include timestamp on output
  -m, --milliseconds  millisecond histogram
  -P, --pids          print a histogram per process ID
  --pidnss            print a histogram per PID namespace
  -L, --tids          print a histogram per thread ID
  -p PID, --pid PID   trace this PID only


# 10 1 表示每 10 秒输出一次，且仅输出一次
$ runqlat-bpfcc 10 1
     usecs               : count     distribution
         0 -> 1          : 1        |                                        |
         2 -> 3          : 1        |                                        |
         4 -> 7          : 3        |**                                      |
         8 -> 15         : 3        |**                                      |
        16 -> 31         : 9        |******                                  |
        32 -> 63         : 9        |******                                  |
        64 -> 127        : 42       |****************************            |   # 大多数线程需要等待 128 -> 255 微秒，在一个空闲的操作系统中显得不是很正常
       128 -> 255        : 58       |****************************************|   # 在一个空闲的操作系统中，线程等待的时间正常应该在 30 微秒以下
       256 -> 511        : 5        |***                                     |   # 不过这是在虚拟机之中执行的，受物理机影响较大


# 使用 sar 命令可以展示 CPU 利用率(-u) 和运行队列性能指标(-q)
$ sar -uq 1
Linux 5.15.0-48-generic (ebpf)  11/09/22        _x86_64_        (8 CPU)

16:36:43        CPU     %user     %nice   %system   %iowait    %steal     %idle
16:36:44        all      0.00      0.00      0.00      0.00      0.00    100.00

16:36:43      runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
16:36:44            0       170      0.00      0.00      0.00         0

# 上面的 sar 命令输出表示 CPU 空闲时间 100%（%idle），平均队列长度为 0（runq-sz）

runqlen

# runqlen 用于采样 CPU 运行队列的长度信息，可以统计有多少线程正在等待运行，并以线性直方图的方式输出
$ runqlen-bpfcc --help
usage: runqlen-bpfcc [-h] [-T] [-O] [-C] [interval] [count]

Summarize scheduler run queue length as a histogram

positional arguments:
  interval         output interval, in seconds
  count            number of outputs

options:
  -h, --help       show this help message and exit
  -T, --timestamp  include timestamp on output
  -O, --runqocc    report run queue occupancy
  -C, --cpus       print output for each CPU separately


# 10 1 表示每 10 秒输出一次，且仅输出一次
$ runqlen 10 1
     runqlen       : count     distribution
        0          : 3954     |****************************************|  # 输出结果表示运行队列的长度为 0 (主要是这个机器上啥都没有)

# -C 参数可以列出每个 CPU 的运行队列长度
$ runqlen-bpfcc -C
cpu = 0
     runqlen       : count     distribution
        0          : 505      |****************************************|

cpu = 2
     runqlen       : count     distribution
        0          : 505      |****************************************|

cpu = 1
     runqlen       : count     distribution
        0          : 505      |****************************************|

cpu = 3
     runqlen       : count     distribution
        0          : 505      |****************************************|
....

runqlat 和 runqlen，运行队列延迟是直接的且按比例的影响系统性能，而运行队列长度却不一定。比如你在超市排队结账，那个指标对你来说更重要：是队伍的长度还是实际需要等待的时间呢？

runqslower

# runqslower 用于列出运行队列中等待延迟超过阈值(10000 微秒)的线程名，输出受影响的进程名和对应的延迟时长
$ runqslower-bpfcc --help
usage: runqslower-bpfcc [-h] [-p PID | -t TID] [min_us]

Trace high run queue latency

positional arguments:
  min_us             minimum run queue latency to trace, in us (default 10000)

options:
  -h, --help         show this help message and exit
  -p PID, --pid PID  trace this PID only
  -t TID, --tid TID  trace this TID only

$ runqslower-bpfcc
TIME     COMM             TID           LAT(us)
17:42:49 python3          4590            16345

cpudist

# cpudist 用于展示每次线程唤醒之后在 CPU 上执行的时长分布
$ cpudist-bpfcc --help
usage: cpudist-bpfcc [-h] [-O] [-T] [-m] [-P] [-L] [-p PID] [interval] [count]

Summarize on-CPU time per task as a histogram.

positional arguments:
  interval            output interval, in seconds
  count               number of outputs

options:
  -h, --help          show this help message and exit
  -O, --offcpu        measure off-CPU time
  -T, --timestamp     include timestamp on output
  -m, --milliseconds  millisecond histogram
  -P, --pids          print a histogram per process ID
  -L, --tids          print a histogram per thread ID
  -p PID, --pid PID   trace this PID only

# 这个输出展示了应用程序在 CPU 之上运行的时间
$ cpudist-bpfcc 10 1
     usecs               : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 1        |*                                       |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 1        |*                                       |
        64 -> 127        : 2        |**                                      |
       128 -> 255        : 11       |************                            |
       256 -> 511        : 26       |*****************************           |
       512 -> 1023       : 24       |***************************             |
      1024 -> 2047       : 5        |*****                                   |
      2048 -> 4095       : 5        |*****                                   |
      4096 -> 8191       : 9        |**********                              |
      8192 -> 16383      : 35       |****************************************|
     16384 -> 32767      : 29       |*********************************       |
     32768 -> 65535      : 25       |****************************            |
     65536 -> 131071     : 14       |****************                        |
    131072 -> 262143     : 15       |*****************                       |
    262144 -> 524287     : 11       |************                            |

# -m 以毫秒直方图展示 CPU 使用时长分布
$ cpudist-bpfcc 10 1 -m
     msecs               : count     distribution
         0 -> 1          : 77       |****************************************|
         2 -> 3          : 12       |******                                  |
         4 -> 7          : 6        |***                                     |
         8 -> 15         : 32       |****************                        |
        16 -> 31         : 22       |***********                             |
        32 -> 63         : 35       |******************                      |
        64 -> 127        : 28       |**************                          |
       128 -> 255        : 23       |***********                             |
       256 -> 511        : 3        |*                                       |

profile

# profile 用于定时采样调用栈信息并汇报调用栈出现频率，这是 BCC 工具中分析 CPU 占用信息最有用的工具之一，因为该工具可以同时记录几乎所有占用 CPU 的代码调用栈
$ profile-bpfcc --help
usage: profile-bpfcc [-h] [-p PID | -L TID] [-U | -K] [-F FREQUENCY | -c COUNT] [-d] [-a] [-I] [-f] [--stack-storage-size STACK_STORAGE_SIZE] [-C CPU] [--cgroupmap CGROUPMAP] [--mntnsmap MNTNSMAP] [duration]

Profile CPU stack traces at a timed interval

positional arguments:
  duration              duration of trace, in seconds

options:
  -h, --help            show this help message and exit
  -p PID, --pid PID     profile process with this PID only
  -L TID, --tid TID     profile thread with this TID only
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space stacks)
  -F FREQUENCY, --frequency FREQUENCY
                        sample frequency, Hertz
  -c COUNT, --count COUNT
                        sample period, number of events
  -d, --delimited       insert delimiter between kernel/user stacks
  -a, --annotations     add _[k] annotations to kernel frames
  -I, --include-idle    include CPU idle stacks
  -f, --folded          output folded format, one line per stack (for flame graphs)
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored and displayed (default 16384)
  -C CPU, --cpu CPU     cpu number to run profile on
  --cgroupmap CGROUPMAP
                        trace cgroups in this BPF map only
  --mntnsmap MNTNSMAP   trace mount namespaces in this BPF map only

$ profile-bpfcc
    b'_copy_to_iter'
    b'_copy_to_iter'
    b'tty_read'
    b'new_sync_read'
    b'vfs_read'
    b'ksys_read'
    b'__x64_sys_read'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    __GI___read
    -                sshd (2829)
        1

    b'filesystems_proc_show'
    b'filesystems_proc_show'
    b'seq_read_iter'
    b'proc_reg_read_iter'
    b'new_sync_read'
    b'vfs_read'
    b'ksys_read'
    b'__x64_sys_read'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    [unknown]
    -                ls (4172)
        1

    b'tty_put_char'
    b'tty_put_char'
    b'do_output_char'
    b'n_tty_write'
    b'do_tty_write'
    b'file_tty_write.constprop.0'
    b'tty_write'
    b'new_sync_write'
    b'vfs_write'
    b'ksys_write'
    b'__x64_sys_write'
    b'do_syscall_64'
    b'entry_SYSCALL_64_after_hwframe'
    [unknown]
    -                cat (4175)
        1

 # 上面这段程序将调用栈信息按照函数作为一个列表输出，列表之后又一个"-"短横线以及进程名(PID)，输出的最后上调用栈信息的个数

offcputime

# offcputime 用于统计线程阻塞和脱离 CPU 运行的时间，同时输出调用栈信息。在 CPU 分析中，这个工具可以用来分析为什么线程没有在 CPU 上运行
$ offcputime-bpfcc --help
usage: offcputime-bpfcc [-h] [-p PID | -t TID | -u | -k] [-U | -K] [-d] [-f] [--stack-storage-size STACK_STORAGE_SIZE] [-m MIN_BLOCK_TIME] [-M MAX_BLOCK_TIME] [--state STATE] [duration]

Summarize off-CPU time by stack trace

positional arguments:
  duration              duration of trace, in seconds

options:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace this PID only
  -t TID, --tid TID     trace this TID only
  -u, --user-threads-only
                        user threads only (no kernel threads)
  -k, --kernel-threads-only
                        kernel threads only (no user threads)
  -U, --user-stacks-only
                        show stacks from user space only (no kernel space stacks)
  -K, --kernel-stacks-only
                        show stacks from kernel space only (no user space stacks)
  -d, --delimited       insert delimiter between kernel/user stacks
  -f, --folded          output folded format
  --stack-storage-size STACK_STORAGE_SIZE
                        the number of unique stack traces that can be stored and displayed (default 1024)
  -m MIN_BLOCK_TIME, --min-block-time MIN_BLOCK_TIME
                        the amount of time in microseconds over which we store traces (default 1)
  -M MAX_BLOCK_TIME, --max-block-time MAX_BLOCK_TIME
                        the amount of time in microseconds under which we store traces (default U64_MAX)
  --state STATE         filter on this thread state bitmask (eg, 2 == TASK_UNINTERRUPTIBLE) see include/linux/sched.h


$ offcputime-bpfcc
    finish_task_switch.isra.0
    schedule
    futex_wait_queue_me
    futex_wait
    do_futex
    __x64_sys_futex
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    runtime.futex.abi0
    runtime.notesleep
    runtime.stopm
    runtime.findrunnable
    runtime.schedule
    runtime.park_m
    runtime.mcall
    runtime.selectgo
    github.com/containerd/containerd/gc/scheduler.(*gcScheduler).run
    github.com/containerd/containerd/gc/scheduler.init.0.func1.1
    runtime.goexit.abi0
    -                containerd (652)
        2698831

    finish_task_switch.isra.0
    schedule
    worker_thread
    kthread
    ret_from_fork
    -                kworker/2:0 (3760)
        2720200

profile 和 offcputime 大多数情况下每秒输出几百个调用栈信息，通常都需要搭配 CPU 火焰图来进行分析

syscount

# syscount 用于统计系统中的系统调用数量，这个工具可以用来调查 CPU 占用时间长的问题
$ syscount-bpfcc --help
usage: syscount-bpfcc [-h] [-p PID] [-i INTERVAL] [-d DURATION] [-T TOP] [-x] [-e ERRNO] [-L] [-m] [-P] [-l]

Summarize syscall counts and latencies.

options:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace only this pid
  -i INTERVAL, --interval INTERVAL
                        print summary at this interval (seconds)
  -d DURATION, --duration DURATION
                        total duration of trace, in seconds
  -T TOP, --top TOP     print only the top syscalls by count or latency
  -x, --failures        trace only failed syscalls (return < 0)
  -e ERRNO, --errno ERRNO
                        trace only syscalls that return this error (numeric or EPERM, etc.)
  -L, --latency         collect syscall latency
  -m, --milliseconds    display latency in milliseconds (default: microseconds)
  -P, --process         count by process and not by syscall
  -l, --list            print list of recognized syscalls and exit

# 每秒输出一次系统调用信息
$ syscount -i 1
[19:40:58]
SYSCALL                   COUNT
clock_gettime                42
bpf                          19
epoll_pwait                  10
futex                         9
write                         6
rt_sigprocmask                4
ppoll                         4
nanosleep                     2
ioctl                         2
getpid                        2

argdist

# argdist 通过跟踪点或者内核函数来统计指定调用的参数和返回值
# 要使用跟踪点，还需要知道参数的名字，可以使用 tplist 工具的 -v 选项来输出
$ tplist-bpfcc -v syscalls:sys_enter_read
syscalls:sys_enter_read
    int __syscall_nr;
    unsigned int fd;
    char * buf;
    size_t count;

$ argdist-bpfcc --help
usage: argdist-bpfcc [-h] [-p PID] [-z STRING_SIZE] [-i INTERVAL] [-d DURATION] [-n COUNT] [-v] [-c] [-T TOP] [-H specifier] [-C specifier] [-I header]

Trace a function and display a summary of its parameter values.

options:
  -h, --help            show this help message and exit
  -p PID, --pid PID     id of the process to trace (optional)
  -z STRING_SIZE, --string-size STRING_SIZE
                        maximum string size to read from char* arguments
  -i INTERVAL, --interval INTERVAL
                        output interval, in seconds (default 1 second)
  -d DURATION, --duration DURATION
                        total duration of trace, in seconds
  -n COUNT, --number COUNT
                        number of outputs
  -v, --verbose         print resulting BPF program code before executing
  -c, --cumulative      do not clear histograms and freq counts at each interval
  -T TOP, --top TOP     number of top results to show (not applicable to histograms)
  -H specifier, --histogram specifier
                        probe specifier to capture histogram of (see examples below)
  -C specifier, --count specifier
                        probe specifier to capture count of (see examples below)
  -I header, --include header
                        additional header files to include in the BPF program as either full path, or relative to relative to current working directory, or relative to default kernel header search path

Probe specifier syntax:
        {p,r,t,u}:{[library],category}:function(signature)[:type[,type...]:expr[,expr...][:filter]][#label]
Where:
        p,r,t,u    -- probe at function entry, function exit, kernel
                      tracepoint, or USDT probe
                      in exit probes: can use $retval, $entry(param), $latency
        library    -- the library that contains the function
                      (leave empty for kernel functions)
        category   -- the category of the kernel tracepoint (e.g. net, sched)
        function   -- the function name to trace (or tracepoint name)
        signature  -- the function's parameters, as in the C header
        type       -- the type of the expression to collect (supports multiple)
        expr       -- the expression to collect (supports multiple)
        filter     -- the filter that is applied to collected values
        label      -- the label for this probe in the resulting output


# 以直方图的方式展示 read 调用请求读取的统计信息
$ argdist-bpfcc -H "t:syscalls:sys_enter_read():int:args->count"
[19:56:20]
     args->count         : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 0        |                                        |
        16 -> 31         : 0        |                                        |
        32 -> 63         : 0        |                                        |
        64 -> 127        : 0        |                                        |
       128 -> 255        : 0        |                                        |
       256 -> 511        : 0        |                                        |
       512 -> 1023       : 0        |                                        |
      1024 -> 2047       : 0        |                                        |
      2048 -> 4095       : 0        |                                        |
      4096 -> 8191       : 0        |                                        |
      8192 -> 16383      : 0        |                                        |
     16384 -> 32767      : 0        |                                        |
     32768 -> 65535      : 1        |****************************************|   # read 调用请求读取的数量集中在 32768 到 65535 字节的区间，一共调用了一次

# sys_enter_read，sys_exit_read enter 表示跟踪点的入口，exit 表示跟踪点的结束
# 以直方图的方式展示 read 调用实际读取的统计信息
$ argdist-bpfcc -H "t:syscalls:sys_exit_read():int:args->ret"
[20:02:06]
     args->ret           : count     distribution
         0 -> 1          : 0        |                                        |
         2 -> 3          : 0        |                                        |
         4 -> 7          : 0        |                                        |
         8 -> 15         : 1        |****************************************|

trace

$ trace-bpfcc --help
usage: trace-bpfcc [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE] [-S] [-M MAX_EVENTS] [-t] [-u] [-T] [-C] [-c CGROUP_PATH] [-n NAME] [-f MSG_FILTER] [-B] [-s SYM_FILE_LIST] [-K] [-U] [-a] [-I header] probe [probe ...]

Attach to functions and print trace messages.

positional arguments:
  probe                 probe specifier (see examples)

options:
  -h, --help            show this help message and exit
  -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES
                        number of pages to use for perf_events ring buffer (default: 64)
  -p PID, --pid PID     id of the process to trace (optional)
  -L TID, --tid TID     id of the thread to trace (optional)
  -v, --verbose         print resulting BPF program code before executing
  -Z STRING_SIZE, --string-size STRING_SIZE
                        maximum size to read from strings
  -S, --include-self    do not filter trace's own pid from the trace
  -M MAX_EVENTS, --max-events MAX_EVENTS
                        number of events to print before quitting
  -t, --timestamp       print timestamp column (offset from trace start)
  -u, --unix-timestamp  print UNIX timestamp instead of offset from trace start, requires -t
  -T, --time            print time column
  -C, --print_cpu       print CPU id
  -c CGROUP_PATH, --cgroup-path CGROUP_PATH
                        cgroup path
  -n NAME, --name NAME  only print process names containing this name
  -f MSG_FILTER, --msg-filter MSG_FILTER
                        only print the msg of event containing this string
  -B, --bin_cmp         allow to use STRCMP with binary values
  -s SYM_FILE_LIST, --sym_file_list SYM_FILE_LIST
                        coma separated list of symbol files to use for symbol resolution
  -K, --kernel-stack    output kernel stack trace
  -U, --user-stack      output user stack trace
  -a, --address         print virtual address in stacks
  -I header, --include header
                        additional header files to include in the BPF program as either full path, or relative to current working directory, or relative to default kernel header search path


# 使用方法和 argdist 类似，argdist 使用的是内核态中的统计计数，所有它可以用于哪些调用非常频繁的系统调用。trace 可以打印出每个事件，适合调查哪些调用不频繁的系统调用，可以显示出每个事件的时间戳和一些其它信息

argdist 和 trace 相对于 bpftrace 的简化版

funccount

# funccount 用于统计事件和函数调用频率
$ funccount-bpfcc --help
usage: funccount-bpfcc [-h] [-p PID] [-i INTERVAL] [-d DURATION] [-T] [-r] [-D] [-c CPU] pattern

Count functions, tracepoints, and USDT probes

positional arguments:
  pattern               search expression for events

options:
  -h, --help            show this help message and exit
  -p PID, --pid PID     trace this PID only
  -i INTERVAL, --interval INTERVAL
                        summary interval, seconds
  -d DURATION, --duration DURATION
                        total duration of trace, seconds
  -T, --timestamp       include timestamp on output
  -r, --regexp          use regular expressions. Default is "*" wildcards only.
  -D, --debug           print BPF program before starting (for debugging purposes)
  -c CPU, --cpu CPU     trace this CPU only

# 统计以 tcp_v4_ 开头的所有函数调用频率，每秒输出一次
$ funccount-bpfcc - i 1 'tcp_v4_*'
FUNC                                    COUNT
b'tcp_v4_send_check'                        4
b'tcp_v4_fill_cb'                           5
b'tcp_v4_do_rcv'                            5
b'tcp_v4_early_demux'                       6
b'tcp_v4_rcv'                               6
b'tcp_v4_inbound_md5_hash'                  6
Detaching...

softirqs

# softirqs 用于显示系统中软中断消耗的 CPU 时间
$ softirqs-bpfcc --help
usage: softirqs-bpfcc [-h] [-T] [-N] [-d] [interval] [count]

Summarize soft irq event time as histograms.

positional arguments:
  interval           output interval, in seconds
  count              number of outputs

options:
  -h, --help         show this help message and exit
  -T, --timestamp    include timestamp on output
  -N, --nanoseconds  output in nanoseconds
  -d, --dist         show distributions as histograms

# net_rx 软中断消耗了 1490 微秒
$ softirqs-bpfcc 10 1
SOFTIRQ          TOTAL_usecs
rcu                      112
sched                    639
timer                   1079
net_rx                  1490

hardirqs

# hardirqs 用于显示系统中硬中断消耗的 CPU 时间
$ hardirqs-bpfcc --help
usage: hardirqs-bpfcc [-h] [-T] [-N] [-C] [-d] [interval] [outputs]

Summarize hard irq event time as histograms

positional arguments:
  interval           output interval, in seconds
  outputs            number of outputs

options:
  -h, --help         show this help message and exit
  -T, --timestamp    include timestamp on output
  -N, --nanoseconds  output in nanoseconds
  -C, --count        show event counts instead of timing
  -d, --dist         show distributions as histograms

$ hardirqs-bpfcc 10 1
HARDIRQ                    TOTAL_usecs
eth0-Tx-Rx-4                     51483

llcstat

# llcstat 利用 PMC 来按进程输出最后一级缓存的命中率
$ llcstat-bpfcc --help
usage: llcstat-bpfcc [-h] [-c SAMPLE_PERIOD] [duration]

Summarize cache references and misses by PID

positional arguments:
  duration              Duration, in seconds, to run

options:
  -h, --help            show this help message and exit
  -c SAMPLE_PERIOD, --sample_period SAMPLE_PERIOD
                        Sample one in this many number of cache reference / miss events


$ llcstat-bpfcc
PID      NAME             CPU     REFERENCE         MISS    HIT%
0        swapper/15       3       1007300           1000    99.90%

CPU 火焰图

FlameGraph

# 下载 FlameGraph
# FlameGraph 是一个火焰图生成工具
# 这个工具是使用 perf 写的，在 linux 系统下通常可以直接运行
$ git clone https://github.com/brendangregg/FlameGraph.git
$ cd FlameGraph
$ ./flamegraph.pl --help
USAGE: ./flamegraph.pl [options] infile > outfile.svg

  --title TEXT     # change title text
  --subtitle TEXT  # second level title (optional)
  --width NUM      # width of image (default 1200)
  --height NUM     # height of each frame (default 16)
  --minwidth NUM   # omit smaller functions (default 0.1 pixels)
  --fonttype FONT  # font type (default "Verdana")
  --fontsize NUM   # font size (default 12)
  --countname TEXT # count type label (default "samples")
  --nametype TEXT  # name type label (default "Function:")
  --colors PALETTE # set color palette. choices are: hot (default), mem,
                   # io, wakeup, chain, java, js, perl, red, green, blue,
                   # aqua, yellow, purple, orange
  --bgcolors COLOR # set background colors. gradient choices are yellow
                   # (default), blue, green, grey; flat colors use "#rrggbb"
  --hash           # colors are keyed by function name hash
  --cp             # use consistent palette (palette.map)
  --reverse        # generate stack-reversed flame graph
  --inverted       # icicle graph
  --flamechart     # produce a flame chart (sort by time, do not merge stacks)
  --negate         # switch differential hues (blue<->red)
  --notes TEXT     # add notes comment in SVG (for debugging)
  --help           # this message

  eg,
  ./flamegraph.pl --title="Flame Graph: malloc()" trace.txt > graph.svg

# profile 命令使用 -f 参数进行折叠输出，以支持生成火焰图
# 折叠输出：整个调用栈以一整行输出，每个函数以分号分割
# -a 在输出中标记内核函数
$ profile-bpfcc -af 300 > out.stacks01

# 生成火焰图
$ ./flamegraph.pl < ./out.stacks01 > out.svg

阅读火焰图

Linux 性能优化和内核观察 - CPU 篇（一）

火焰图中的每一个方框是一个函数，方框的长度，代表了它的执行时间，所以越宽的函数，执行耗时越长
火焰图的楼层每高一层，就是更深一级的函数被调用，最顶层的函数，是叶子函数
火焰图是基于 stack 信息生成的 SVG 图片, 用来展示 CPU 的调用栈
y 轴表示调用栈，每一层都是一个函数，调用栈越深，火焰就越高，顶部就是正在执行的函数，下方都是它的父函数
x 轴表示抽样数，如果一个函数在 x 轴占据的宽度越宽，就表示它被抽到的次数多，即执行的时间长

注意：x 轴不代表时间，而是所有的调用栈合并后，按字母顺序排列的

火焰图是 SVG 图片，用浏览器打开可以与用户互动

Linux 性能优化和内核观察 - CPU 篇（一）

简介

CPU 基础知识

CPU 的运行模式

CPU 上下文切换

进程上下文切换

线程上下文切换

中断上下文切换

中断

平均负载

进程状态

CPU 性能分析和优化策略

性能分析策略

PMC

PMC 工作模式

PEBS

性能优化策略

传统工具

uptime

pidstat

mpstat

top

vmstat

工具输出单位解释

bpf 工具

execsnoop

exitsnoop

runqlat

runqlen

runqslower

cpudist

profile

offcputime

syscount

argdist

trace

funccount

softirqs

hardirqs

llcstat

CPU 火焰图

FlameGraph

阅读火焰图

相关文章