linux 的那些hung 检测机制

在dmesg中，看到如下信息：

[424948.577401] ixgbe ::00.0 eth4: Fake Tx hang detected with timeout of  seconds

[424949.535143] ixgbe ::00.1 eth5: Fake Tx hang detected with timeout of  seconds

[424955.536045] ixgbe :af:00.0 eth6: Fake Tx hang detected with timeout of  seconds

[424955.567988] ixgbe :af:00.1 eth7: Fake Tx hang detected with timeout of  seconds

[424957.579250] ixgbe ::00.1 eth1: Fake Tx hang detected with timeout of  seconds

[424957.579285] ixgbe :3b:00.1 eth3: Fake Tx hang detected with timeout of  seconds

[424958.568923] ixgbe ::00.0 eth4: Fake Tx hang detected with timeout of  seconds

[424959.526676] ixgbe ::00.1 eth5: Fake Tx hang detected with timeout of  seconds

[424975.489166] ixgbe :af:00.0 eth6: Fake Tx hang detected with timeout of  seconds

[424975.553019] ixgbe :af:00.1 eth7: Fake Tx hang detected with timeout of  seconds

[424977.532376] ixgbe ::00.1 eth1: Fake Tx hang detected with timeout of  seconds

[424977.532409] ixgbe :3b:00.1 eth3: Fake Tx hang detected with timeout of  seconds

检测超时的函数：

static void fm10k_tx_timeout(struct net_device *netdev)

{

    struct fm10k_intfc *interface = netdev_priv(netdev);

    bool real_tx_hang = false;

    int i;

#define TX_TIMEO_LIMIT 16000

    for (i = ; i < interface->num_tx_queues; i++) {

        struct fm10k_ring *tx_ring = interface->tx_ring[i];

        if (check_for_tx_hang(tx_ring) && fm10k_check_tx_hang(tx_ring))

            real_tx_hang = true;

    }

    if (real_tx_hang) {

        fm10k_tx_timeout_reset(interface);

    } else {

        netif_info(interface, drv, netdev,

               "Fake Tx hang detected with timeout of %d seconds\n",

               netdev->watchdog_timeo / HZ);

        /* fake Tx hang - increase the kernel timeout */

        if (netdev->watchdog_timeo < TX_TIMEO_LIMIT)

            netdev->watchdog_timeo *= ;-----------按倍数递增，直到大于16s，本文就是5-10-20递增，

    }

}

网卡检测是否hung的关键函数是 fm10k_tx_timeout，如果 if (check_for_tx_hang(tx_ring) && fm10k_check_tx_hang(tx_ring)) 条件满足，则会属于real hung，否则是fake hung。

check_for_tx_hang(tx_ring)肯定都是满足的，一般在probe的时候就会设置，fm10k_check_tx_hang 的代码如下：

bool fm10k_check_tx_hang(struct fm10k_ring *tx_ring)

{

    u32 tx_done = fm10k_get_tx_completed(tx_ring);

    u32 tx_done_old = tx_ring->tx_stats.tx_done_old;

    u32 tx_pending = fm10k_get_tx_pending(tx_ring, true);

    clear_check_for_tx_hang(tx_ring);

    /* Check for a hung queue, but be thorough. This verifies

     * that a transmit has been completed since the previous

     * check AND there is at least one packet pending. By

     * requiring this to fail twice we avoid races with

     * clearing the ARMED bit and conditions where we

     * run the check_tx_hang logic with a transmit completion

     * pending but without time to complete it yet.

     */

    if (!tx_pending || (tx_done_old != tx_done)) {-----------------没有pending的报文，或者pending的值没变过

        /* update completed stats and continue */

        tx_ring->tx_stats.tx_done_old = tx_done;

        /* reset the countdown */

        clear_bit(__FM10K_HANG_CHECK_ARMED, &tx_ring->state);

        return false;

    }

    /* make sure it is true for two checks in a row */

    return test_and_set_bit(__FM10K_HANG_CHECK_ARMED, &tx_ring->state);----------------两次alarm，则肯定返回true

}

伴随网卡hung打印的，一般都有cpu的softlock，如果cpu 是softlock，而且tx做了cpu绑定的话，那么该cpu对应的tx则会没有pending报文，从而触发hung。如果没有做绑定，则这个tx可能被多个cpu来使用，如果再出现hung，则要查看对应的tx的锁，是否被拿了没有释放。

阶段性总结一下：

内核中检测hung有不同的对象，不同的级别。

1.本文说的网卡的hung，针对的是某个设备，级别是网卡的队列，原理是检测是否有pending的tx包超时没有处理。它依赖于网卡设备正常。

2.还有一种检测某个调度进程的hung的机制，就是hung_task.c文件中的khungtaskd内核线程，该内核线程检测处于uninterrupt状态的进程持续的时间，如果大于一个阈值，则认为该进程hung住了，这个检测的方法是遍历task，然后看task的调度次数是否变化了，这个是单个进程级别。对象是处于uninterrupt状态的进程如果时间长了，则认为hung，它依赖于调度。

3.一种是检测softlock导致的hung，主要是检测某个cpu级别进程调度是否正常，是watchdog内核线程来做的，因为它是实时进程，如果前后两次它没有获取到调度，则说明调度出了问题，这个前后是指通过hrtimer的硬中断来触发的wakeup来判断。这个对象是某个cpu核（到超线程级别）。它依赖于硬中断，关抢占时间长了没有让出cpu，则会出softlock。

4.一种是检测hardlock的hung，它依赖于nmi，原理就是利用3里面那个hrtimer，每次3里面的hrtimer来了，则增长当前cpu的 hrtimer_interrupts ，如果前后两次nmi的回调检测这个计数没有增长，则认为cpu遇到了hardlock，也就是关中断时间长了，则会出hardlock。

下面详细描述：

[root@centos7 WakeTest]# ps -ef |grep -i khungtaskd |grep -v grep

root                9月04 ?       :: [khungtaskd]----------------------检测处于D状态的进程是否长时间未被调度

名称是khungtaskd，和watchdog注意区分：

static int __init hung_task_init(void)

{

    atomic_notifier_chain_register(&panic_notifier_list, &panic_block);

    watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");--------虽然内核线程的函数是watchdog，但是线程名字却是khungtaskd

    return ;

}

另外一个名称为watchdog内核线程：

ps |grep -i watchdog

                 ffff880c11980080  IN   0.0               [watchdog/]

                ffff880c11a2b580  IN   0.0               [watchdog/]

                ffff880c11a56a80  IN   0.0               [watchdog/]

                ffff880c11a62080  IN   0.0               [watchdog/]

                ffff880c11a9f580  IN   0.0               [watchdog/]

                ffff880c11aa8a80  IN   0.0               [watchdog/]

                ffff880c11ab4080  IN   0.0               [watchdog/]

                ffff880c11acd580  IN   0.0               [watchdog/]

                ffff880c11ad6a80  IN   0.0               [watchdog/]

                ffff880c11b04080  IN   0.0               [watchdog/]

               ffff880c11b45580  IN   0.0               [watchdog/]

               ffff880c11b4ea80  IN   0.0               [watchdog/]

               ffff880c11b5e080  IN   0.0               [watchdog/]

               ffff880c11b77580  IN   0.0               [watchdog/]

               ffff880c11b80a80  IN   0.0               [watchdog/]

               ffff880c11baa080  IN   0.0               [watchdog/]

这个是由watchdog.c中，每个cpu一个：

static struct smp_hotplug_thread watchdog_threads = {

    .store            = &softlockup_watchdog,

    .thread_should_run    = watchdog_should_run,

    .thread_fn        = watchdog,

    .thread_comm        = "watchdog/%u",

    .setup            = watchdog_enable,

    .cleanup        = watchdog_cleanup,

    .park            = watchdog_disable,

    .unpark            = watchdog_enable,

};

使能的一些函数以及回调：

/*

 * common function for watchdog, nmi_watchdog and soft_watchdog parameter

 *

 * caller             | table->data points to | 'which' contains the flag(s)

 * -------------------|-----------------------|-----------------------------

 * proc_watchdog      | watchdog_user_enabled | NMI_WATCHDOG_ENABLED or'ed

 *                    |                       | with SOFT_WATCHDOG_ENABLED

 * -------------------|-----------------------|-----------------------------

 * proc_nmi_watchdog  | nmi_watchdog_enabled  | NMI_WATCHDOG_ENABLED

 * -------------------|-----------------------|-----------------------------

 * proc_soft_watchdog | soft_watchdog_enabled | SOFT_WATCHDOG_ENABLED

 */

要关闭这些内核线程，使用：

[root@centos7 WakeTest]# echo  > /proc/sys/kernel/watchdog

[root@centos7 WakeTest]# ps -ef |grep -w watchdog |grep -v grep

[root@centos7 WakeTest]#

[root@centos7 WakeTest]#

[root@centos7 WakeTest]# echo  > /proc/sys/kernel/watchdog

[root@centos7 WakeTest]# ps -ef |grep -w watchdog |grep -v grep

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

root             : ?        :: [watchdog/]

他们都是实时进程：

top - :: up :,  users,  load average: 41.97, 45.49, 48.37

Tasks:    total,    running,    sleeping,    stopped,    zombie

%Cpu(s):  7.1 us, 14.7 sy,  0.0 ni, 54.7 id,  4.2 wa,  2.5 hi, 16.8 si,  0.0 st, 57.3 id_exact,  2.9 hi_exact, 20.0 irq_exact

KiB Mem : +total,  free, +used, +buff/cache

KiB Swap:         total,         free,         used. +avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

    root      rt                       S   0.0  0.0   :00.10 watchdog/3

watchdog检测的原理是：

watchdog函数负责根据当前时间戳来更新一个自己保存的时间戳percpu变量watchdog_touch_ts （取到s级别）

，然后另外的一个hrtimer负责比较当前时间与watchdog_touch_ts 这个变量的差值，如果这个差值大于某个阈值watchdog，则认为异常。 hrtimer同时负责wakeup watchdog线程，

hrtimer 中用 is_softlockup 用来确定是否已经软锁，按道理唤醒watchdog之后，watchdog应该要调度，同时更新时间戳，如果没有更新，说明没有获得调度，由于watchdog内核线程是

绑定cpu核的实时线程，实时线程未能调度，则代表这个cpu出现了软锁。

static int is_softlockup(unsigned long touch_ts)-----------------------touch_ts就是watchdog线程write的时间

{

    unsigned long now = get_timestamp();

    if ((watchdog_enabled & SOFT_WATCHDOG_ENABLED) && watchdog_thresh){

        /* Warn about unreasonable delays. */

        if (time_after(now, touch_ts + get_softlockup_thresh()))

            return now - touch_ts;

    }

    return ;

}

这个检测机制，大家可以看到，明显依赖于硬中断的到来，假设某个cpu关闭硬中断很长的时间，那显然就没办法保证watchdog的运行了，所以又必要检测一下，这个hardlock登上舞台。

static bool is_hardlockup(void)

{

    unsigned long hrint = __this_cpu_read(hrtimer_interrupts);

    if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)

        return true;

    __this_cpu_write(hrtimer_interrupts_saved, hrint);

    return false;

}

秒客网

linux 的那些hung 检测机制

相关文章