生产问题（十四）K8S抢占CPU导致数据库链接池打爆

一、引言

线上一天出现了两次数据库连接失败的大量报错，一开始以为是数据库的问题，但是想了想如果是数据库的问题，应该会有大量的应用问题

具体分析之后，发现其实是容器cpu出现了Throttled，导致大量线程阻塞

二、分析

1、堆栈

既然出现了报错，又没有发布，先看看堆栈里面报错的地方

追踪到最底层的堆栈，显示的是数据库链接池耗尽[size:100; busy:1; idle:0; lastwait:44]，100个链接只有4个在处理，没有一个空闲的，那么很明显其他的链接所在的线程都被阻塞了

2、源码

虽然说不太可能，但是还是要看看源码。大多数同学都觉得框架里面是不可能有错的，所以也不会去看源码，基本上都是去其他方向分析，这个在99%的情况下没有问题，但是作者不止一次遇到框架本身是有问题的，只是遇到了一些极端情况。比如：

Mybatis拼接sql出错及源码解析_-****博客

生产问题（九）Mysql8.0 ddl问题_mysql8 default_authentication_plugin-****博客

生产问题（十一）日志JavaAgent-NoClassDefFoundError_-javaagent 找不到报错-****博客

生产问题（十三）谷歌Protobuf误修改系统全局时区-****博客

这些问题，如果不往框架的方向上走是找不到问题原因的，那么言归正传，我们看源码，这是tomcat进行线程链接的地方，超时的话就会报

throw new PoolExhaustedException("[" + Thread.currentThread().getName()+"] " +
    "Timeout: Pool empty. Unable to fetch a connection in " + (maxWait / 1000) +
    " seconds, none available[size:"+size.get() +"; busy:"+busy.size()+"; idle:"+idle.size()+"; lastwait:"+timetowait+"].");

和lastwait:44的现象也对的上，给我们进一步指明了方向，有很多占据数据库链接的线程不做事，其他很多线程在等

private PooledConnection borrowConnection(int wait, String username, String password) throws SQLException {

        if (isClosed()) {
            throw new SQLException("Connection pool closed.");
        } //end if

        //get the current time stamp
        long now = System.currentTimeMillis();
        //see if there is one available immediately
        PooledConnection con = idle.poll();

        while (true) {
            if (con!=null) {
                //configure the connection and return it
                PooledConnection result = borrowConnection(now, con, username, password);
                if (result!=null) return result;
            }

            //if we get here, see if we need to create one
            //this is not 100% accurate since it doesn't use a shared
            //atomic variable - a connection can become idle while we are creating
            //a new connection
            if (size.get() < getPoolProperties().getMaxActive()) {
                //atomic duplicate check
                if (size.addAndGet(1) > getPoolProperties().getMaxActive()) {
                    //if we got here, two threads passed through the first if
                    size.decrementAndGet();
                } else {
                    //create a connection, we're below the limit
                    return createConnection(now, con, username, password);
                }
            } //end if

            //calculate wait time for this iteration
            long maxWait = wait;
            //if the passed in wait time is -1, means we should use the pool property value
            if (wait==-1) {
                maxWait = (getPoolProperties().getMaxWait()<=0)?Long.MAX_VALUE:getPoolProperties().getMaxWait();
            }

            long timetowait = Math.max(0, maxWait - (System.currentTimeMillis() - now));
            waitcount.incrementAndGet();
            try {
                //retrieve an existing connection
                con = idle.poll(timetowait, TimeUnit.MILLISECONDS);
            } catch (InterruptedException ex) {
                if (getPoolProperties().getPropagateInterruptState()) {
                    Thread.currentThread().interrupt();
                }
                SQLException sx = new SQLException("Pool wait interrupted.");
                sx.initCause(ex);
                throw sx;
            } finally {
                waitcount.decrementAndGet();
            }
            if (maxWait==0 && con == null) { //no wait, return one if we have one
                if (jmxPool!=null) {
                    jmxPool.notify(org.apache.tomcat.jdbc.pool.jmx.ConnectionPool.POOL_EMPTY, "Pool empty - no wait.");
                }
                throw new PoolExhaustedException("[" + Thread.currentThread().getName()+"] " +
                        "NoWait: Pool empty. Unable to fetch a connection, none available["+busy.size()+" in use].");
            }
            //we didn't get a connection, lets see if we timed out
            if (con == null) {
                if ((System.currentTimeMillis() - now) >= maxWait) {
                    if (jmxPool!=null) {
                        jmxPool.notify(org.apache.tomcat.jdbc.pool.jmx.ConnectionPool.POOL_EMPTY, "Pool empty - timeout.");
                    }
                    throw new PoolExhaustedException("[" + Thread.currentThread().getName()+"] " +
                        "Timeout: Pool empty. Unable to fetch a connection in " + (maxWait / 1000) +
                        " seconds, none available[size:"+size.get() +"; busy:"+busy.size()+"; idle:"+idle.size()+"; lastwait:"+timetowait+"].");
                } else {
                    //no timeout, lets try again
                    continue;
                }
            }
        } //while
    }

3、监控

既然分析出有很多占据数据库链接的线程不做事，其他线程在等，那么在cpu、线程方面就应该有所体现，接下来看监控。

监控显示对应问题的时间点发生了CPU Throttled，他会限制应用程序或进程的CPU使用率，一度达到十几秒，等待io的task线程激增，很明显这就是具体的原因，接下来需要找寻的是，为什么会产生Throttled。

jvm的内存和线程激增也侧面验证了cpu阻塞，大量的线程处理不了任务

三、cpu抢占

通过上述分析，接下来需要找寻为什么会产生Throttled？

一般来说有两种原因：

1、瞬时流量激增

这种一般就是入口流量突然增加，已经分配的cpu处理不过来，需要按照K8S的动态分配继续申请cpu，但是来不及

看看请求数量，整体变化不大，但是监控采集是一分钟一次，所以这个方向不能被排除，但是没有其他方面的证据协助分析

2、宿主机cpu被抢占

这段监控是下午出问题时候的，上午找不到是哪个宿主机了

可以看到这个宿主机不太对，他这个0.7是每个cpu维度的，对应时间从0.3到0.7，相当于原来用三个，突然用到了7个，所以宿主机争抢的概率还是大一些

再看看具体每个容器，明显黄色容器的cpu有一段飙升，倒数第三个是出问题的机器，上面的几个容器都有增加，一方面也代表了cpu被其他容器抢占的可能

四、解决

由于瞬时流量激增、宿主机cpu被抢占两个方向的解决方案是完全不同的，所以我们需要先快速处理。

1、升配、增加机器数量

这样瞬时流量就会大幅减小，由于流量小了，cpu被抢占也就不会有太大影响

2、设置cpu set模式

K8S给予linux的cgroups默认的一般都是share模式，就是给你分这么多cpu但是后续可以被抢占。

设置set模式，就是不给别人抢占，当然你忙的时候也抢不了别人

作者认为对于核心应用来说隔离是必须的，如果他隔离的情况下处理不了就应该升配或者扩容，而不是抢别人的

3、增加秒级监控

一分钟一次的监控属实是坑，基本上辅助不了确定的方向，只能是结合监控数据去猜一下，之前出现过很多问题也因为监控确定不了具体原因，只是锁定几个方向

但是怎么增加，增加有没有影响都是件麻烦事，因为框架组那边很直白的说了一分钟一次对于目前的监控都有压力，秒级的要自己暴露出prometheus规范的指标，通过加env的方式暴露出来，框架的监控agent会去采集。

那他为什么不去全部都加一下呢，还是内部有一些资源或者风险存在的，这就很顶了。

五、总结

这一次的问题被锁定在了瞬时流量激增、宿主机cpu被抢占两个方向，宿主机cpu被抢占概率大一些，监控分钟级别的采集不能确定唯一方向，只能沿着两个方向一起解决。

这也是个好事，起码知道了监控的机制，后续需要改进的方向。

秒客网