利用tcpcopy,发现nginx一重大bug

时间:2021-10-07 03:10:12

上周运维组利用tcpcopy压力测试nginx hmux模块,压力达到2000多请求每秒的时候,发现nginx 每个进程 cpu100%。

利用gcore产生core文件,发现nginx处于如下死循环中:


(gdb) bt
#0  0x00000000004690d8 in ngx_http_upstream_get_round_robin_peer (pc=0x8a51bd8, data=0x899fc88)
    at src/http/ngx_http_upstream_round_robin.c:482
#1  0x00000000004a45a0 in ngx_http_upstream_get_keepalive_peer (pc=0x8a51bd8, data=0x899fc58)
    at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
#2  0x0000000000428c53 in ngx_event_connect_peer (pc=0x8a51bd8) at src/event/ngx_event_connect.c:24
#3  0x0000000000460465 in ngx_http_upstream_connect (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1081
#4  0x000000000046428a in ngx_http_upstream_next (r=0x8a4b770, u=0x8a51bc8, ft_type=2) at src/http/ngx_http_upstream.c:2892
#5  0x0000000000461483 in ngx_http_upstream_process_header (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1525
#6  0x000000000046104f in ngx_http_upstream_send_request (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1392
#7  0x00000000004608a7 in ngx_http_upstream_connect (r=0x8a4b770, u=0x8a51bc8) at src/http/ngx_http_upstream.c:1182
#8  0x000000000045f634 in ngx_http_upstream_init_request (r=0x8a4b770) at src/http/ngx_http_upstream.c:620
#9  0x000000000045ee79 in ngx_http_upstream_init (r=0x8a4b770) at src/http/ngx_http_upstream.c:429
#10 0x0000000000455dd4 in ngx_http_read_client_request_body (r=0x8a4b770, post_handler=0x45ed95 <ngx_http_upstream_init>)
    at src/http/ngx_http_request_body.c:153
#11 0x000000000049e3bd in ngx_hmux_handler (r=0x8a4b770) at ../hmux/ngx_hmux_module.c:673
#12 0x000000000043eb06 in ngx_http_core_content_phase (r=0x8a4b770, ph=0x89bc6c0) at src/http/ngx_http_core_module.c:1350
#13 0x000000000043d75c in ngx_http_core_run_phases (r=0x8a4b770) at src/http/ngx_http_core_module.c:852
#14 0x000000000043d6d8 in ngx_http_handler (r=0x8a4b770) at src/http/ngx_http_core_module.c:835
#15 0x000000000044af21 in ngx_http_process_request (r=0x8a4b770) at src/http/ngx_http_request.c:1641
#16 0x00000000004498a5 in ngx_http_process_request_headers (rev=0x2af3569ee180) at src/http/ngx_http_request.c:1084
#17 0x0000000000449092 in ngx_http_process_request_line (rev=0x2af3569ee180) at src/http/ngx_http_request.c:889
#18 0x00000000004483bd in ngx_http_init_request (rev=0x2af3569ee180) at src/http/ngx_http_request.c:514
#19 0x0000000000433a6f in ngx_epoll_process_events (cycle=0x8995420, timer=79, flags=1)
    at src/event/modules/ngx_epoll_module.c:642
#20 0x0000000000425079 in ngx_process_events_and_timers (cycle=0x8995420) at src/event/ngx_event.c:245
#21 0x00000000004318ae in ngx_worker_process_cycle (cycle=0x8995420, data=0x0) at src/os/unix/ngx_process_cycle.c:795
---Type <return> to continue, or q <return> to quit---
#22 0x000000000042e621 in ngx_spawn_process (cycle=0x8995420, proc=0x4316df <ngx_worker_process_cycle>, data=0x0,
    name=0x571ef6 "worker process", respawn=-3) at src/os/unix/ngx_process.c:196
#23 0x0000000000430791 in ngx_start_worker_processes (cycle=0x8995420, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:355
#24 0x000000000042fe9a in ngx_master_process_cycle (cycle=0x8995420) at src/os/unix/ngx_process_cycle.c:136
#25 0x0000000000403bf3 in main (argc=1, argv=0x7fff5446ef18) at src/core/nginx.c:401



这个bug据运维人员一查,很失望,在nginx 1.0.7解决了此问题,我们利用的是0.8系列。


Fixing cpu hog with all upstream servers marked "down".

The following configuration causes nginx to hog cpu due to infinite loop
in ngx_http_upstream_get_peer():

    upstream backend {
        server 127.0.0.1:8080 down;
        server 127.0.0.1:8080 down;
    }

    server {
       ...
       location / {
           proxy_pass http://backend;
       }
    }

Make sure we don't loop infinitely in ngx_http_upstream_get_peer() but stop
after resetting peer weights once.

Return 0 if we are stuck.  This is guaranteed to work as peer 0 always exists,
and eventually ngx_http_upstream_get_round_robin_peer() will do the right
thing falling back to backup servers or returning NGX_BUSY.



然后我们利用nginx 1.0.10,来测试,目前还没有重现上周发现的bug。


要是我们早点测试这个模块,也许我们就能够首先发现这个bug。

不过机会是有的,因为我们有tcpcopy,什么样的在线流量都能够制造,不怕找不到流行开源软件的bug



上午那台机器没有重现此bug,但在新的一台机器出现了。

core文件信息如下:



#0  0x000000000046ae0b in ngx_http_upstream_get_round_robin_peer (pc=0x1b822678, data=0x1b826e10)
    at src/http/ngx_http_upstream_round_robin.c:482
#1  0x00000000004a6a60 in ngx_http_upstream_get_keepalive_peer (pc=0x1b822678, data=0x1b826de0)
    at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
#2  0x0000000000429fab in ngx_event_connect_peer (pc=0x1b822678) at src/event/ngx_event_connect.c:24
#3  0x000000000046219c in ngx_http_upstream_connect (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1103
#4  0x0000000000465f61 in ngx_http_upstream_next (r=0x1b821230, u=0x1b822668, ft_type=2)
    at src/http/ngx_http_upstream.c:2900
#5  0x00000000004631c0 in ngx_http_upstream_process_header (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1547
#6  0x0000000000462d8c in ngx_http_upstream_send_request (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1414
#7  0x00000000004625de in ngx_http_upstream_connect (r=0x1b821230, u=0x1b822668) at src/http/ngx_http_upstream.c:1204
#8  0x000000000046129d in ngx_http_upstream_init_request (r=0x1b821230) at src/http/ngx_http_upstream.c:631
#9  0x0000000000460a6d in ngx_http_upstream_init (r=0x1b821230) at src/http/ngx_http_upstream.c:432
#10 0x000000000045782b in ngx_http_read_client_request_body (r=0x1b821230, post_handler=0x460989 <ngx_http_upstream_init>)
    at src/http/ngx_http_request_body.c:154
#11 0x00000000004a0875 in ngx_hmux_handler (r=0x1b821230) at ../hmux/ngx_hmux_module.c:673
#12 0x000000000043ffd1 in ngx_http_core_content_phase (r=0x1b821230, ph=0x1b5d4f88) at src/http/ngx_http_core_module.c:1365
#13 0x000000000043ebf5 in ngx_http_core_run_phases (r=0x1b821230) at src/http/ngx_http_core_module.c:861
#14 0x000000000043eb71 in ngx_http_handler (r=0x1b821230) at src/http/ngx_http_core_module.c:844
#15 0x000000000044c9a1 in ngx_http_process_request (r=0x1b821230) at src/http/ngx_http_request.c:1665
#16 0x000000000044b34d in ngx_http_process_request_headers (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:1111
#17 0x000000000044ab26 in ngx_http_process_request_line (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:911
#18 0x0000000000449d7d in ngx_http_init_request (rev=0x2b3f10a3f4f0) at src/http/ngx_http_request.c:518
#19 0x0000000000434ea7 in ngx_epoll_process_events (cycle=0x1b5ae090, timer=1352, flags=1)
    at src/event/modules/ngx_epoll_module.c:678
#20 0x0000000000426480 in ngx_process_events_and_timers (cycle=0x1b5ae090) at src/event/ngx_event.c:245
---Type <return> to continue, or q <return> to quit---
#21 0x0000000000432d0c in ngx_worker_process_cycle (cycle=0x1b5ae090, data=0x0) at src/os/unix/ngx_process_cycle.c:801
#22 0x000000000042faad in ngx_spawn_process (cycle=0x1b5ae090, proc=0x432b60 <ngx_worker_process_cycle>, data=0x0,
    name=0x57450e "worker process", respawn=-3) at src/os/unix/ngx_process.c:196
#23 0x0000000000431c12 in ngx_start_worker_processes (cycle=0x1b5ae090, n=8, type=-3) at src/os/unix/ngx_process_cycle.c:360
#24 0x00000000004312fe in ngx_master_process_cycle (cycle=0x1b5ae090) at src/os/unix/ngx_process_cycle.c:136
#25 0x0000000000403c8c in main (argc=1, argv=0x7fffba823048) at src/core/nginx.c:405
(gdb) info lo
rrp = 0x1b826e10
now = 1321860194
m = 8
rc = 461502464
i = 18446744061305903636  ----注意这儿i的值

n = 0
c = 0x40
peer = 0x1b5cf2c0
peers = 0x1

(gdb) p *rrp
$5 = {peers = 0x1b5cf230, current = 3, tried = 0x1b826e28, data = 4}
(gdb) p rrp->peers->peer[0]
$6 = {sockaddr = 0x1b5c63b8, socklen = 16, name = {len = 19, data = 0x1b5c63c8 "xxx.xxx.xxx.xxx:36901server"},
  current_weight = 0, weight = 1, fails = 1, accessed = 1321860193, max_fails = 1, fail_timeout = 10, down = 0,
  ssl_session = 0x0}
(gdb) p rrp->peers->peer[1]
$7 = {sockaddr = 0x1b5c6418, socklen = 16, name = {len = 19, data = 0x1b5c6428 "xxx.xxx.xxx.xxx:36902server"},
  current_weight = 0, weight = 1, fails = 1, accessed = 1321860194, max_fails = 1, fail_timeout = 10, down = 0,
  ssl_session = 0x0}
(gdb) p rrp->peers->peer[2]
$8 = {sockaddr = 0x1b5c6478, socklen = 16, name = {len = 19, data = 0x1b5c6488 "xxx.xxx.xxx.xxx:36903server"},
  current_weight = 0, weight = 1, fails = 1, accessed = 1321860194, max_fails = 1, fail_timeout = 10, down = 0,
  ssl_session = 0x0}
(gdb) p rrp->peers->peer[3]
$9 = {sockaddr = 0x1b5c64d8, socklen = 16, name = {len = 19, data = 0x1b5c64e8 "xxx.xxx.xxx.xxx:36904keepalive"},
  current_weight = 0, weight = 1, fails = 1, accessed = 1321860193, max_fails = 1, fail_timeout = 10, down = 0,
  ssl_session = 0x0}


(gdb) up 1
#1  0x00000000004a6a60 in ngx_http_upstream_get_keepalive_peer (pc=0x1b822678, data=0x1b826de0)
    at ../ngx_http_upstream_keepalive-c6396fef9295/ngx_http_upstream_keepalive_module.c:227
227         rc = kp->original_get_peer(pc, kp->data);
(gdb) info lo
kp = 0x1b826de0
item = 0x6300000000
rc = 0
q = 0x1b826d51
cache = 0x0
c = 0x1b821c50


nginx版本和编译选项:

[root@208 sbin]# /home/nginx/sbin/nginx -V
nginx: nginx version: nginx/1.0.10
nginx: built by gcc 4.1.2 20080704 (Red Hat 4.1.2-48)
nginx: TLS SNI support enabled
nginx: configure arguments: --prefix=/home/nginx --user=nginx --group=nginx --without-select_module --without-poll_module --with-http_ssl_module --with-http_realip_module --with-http_gzip_static_module --with-http_stub_status_module --without-http_fastcgi_module --without-http_memcached_module --without-http_empty_gif_module --with-pcre=../pcre-7.9 --with-openssl=../openssl-0.9.8r --add-module=../hmux --add-module=../ngx_http_upstream_keepalive-c6396fef9295 --with-debug

nginx部分配置如下:

events {
    worker_connections  8192;
    use epoll;
    epoll_events 4096;
    accept_mutex off;
}

...
    upstream yyy  {
        server xxx.xxx.xxx.xxx:36901;
        server xxx.xxx.xxx.xxx:36902;
        server xxx.xxx.xxx.xxx:36903;
        server xxx.xxx.xxx.xxx:36904;

        keepalive 1024;
    }

...


各位nginx牛人,帮我们确认一下是否就是nginx的bug。在此谢谢了

联系方式:

wangbin5790@hotmail.com