OpenVPN莫名其妙断线的问题及其解决

时间:2023-03-09 13:33:30
OpenVPN莫名其妙断线的问题及其解决

1.问题

不得不说,这是一个OpenVPN的问题,该问题几乎每个使用OpenVPN的人都碰到过,也有很多人在网上发问,然而一直都没有人能给出解决办法,甚至很多帖子上表示因为这个问题而放弃了使用OpenVPN。说实话,我面临这个问题已经两年有余,自从第一次接触OpenVPN,这个问题就一直困扰着我,去过国内外各大论坛也没有找到满意的结果。这几天终于有点闲暇,我决定自己去摸索一下,要感谢公司能给我提供一个环境!最终,我取得了突破性的进展,还是那句话,我把这个结果贴了出来,就是为了以后人们再面临这个问题时可以多一个可选的答案。
        顺便说一下,并不能说明网上就没人解决过这个问题,因为我所能看到并理解的,只有中文或者英文的帖子或者文章,虽然日文的也在我老婆的帮忙翻译下看过一些,但是还有大量的德文,意大利文,韩文等作为母语的人写出的东西我无法找到并且理解它,因此为了通用性,我本应该用英文来写这篇文章,然而英文水平太垃圾,怕那样连中国人都不能理解了...

        问题是这样的,OpenVPN在跨越公网上连接时,会莫名其妙的时不时断开,但不经常,也不绝对!由于大部分人使用Windows版本的作为OpenVPN客户端,因此起初一直一为是Windows本身的问题,然而当我用Linux客户端连接时,还是一样,这就是说,很大程度上冤枉了Windows(也并不是完全冤枉,起码Linux就没有DHCP租约的问题),于是既然有了环境,那就折腾一番,因此又是一个惊魂48小时。

        以下是在客户端断开时服务端的日志(频繁的断开就会有频繁的日志,现在仅仅截取一段):

2013-07-24 16:53:15  MULTI: REAP range 208 -> 224
...
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [5] 1 2 3 4
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 1 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=5 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [6] 5 2 3 4
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 2 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=6 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [7] 5 6 3 4
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 3 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=7 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [8] 5 6 7 4
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 4 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=8 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [9] 5 6 7 8
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 5 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=9 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [10] 9 6 7 8
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 6 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=10 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [11] 9 10 7 8
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 7 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=11 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 9 10 11 8
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 9 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 10 11 8
2013-07-24 16:53:17  MULTI: REAP range 240 -> 256
2013-07-24 16:53:17  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:17  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 10 ]
2013-07-24 16:53:17  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 11 8
2013-07-24 16:53:17  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:17  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 11 ]
2013-07-24 16:53:17  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 8
2013-07-24 16:53:18  MULTI: REAP range 0 -> 16
2013-07-24 16:53:18  Test证书/218.242.253.131:18014 TLS: tls_pre_encrypt: key_id=0
2013-07-24 16:53:18  Test证书/218.242.253.131:18014 SENT PING
2013-07-24 16:53:18  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 8
2013-07-24 16:53:18  Test证书/218.242.253.131:18014 UDPv4 WRITE [53] to 218.242.253.131:18014: P_DATA_V1 kid=0 DATA len=52
2013-07-24 16:53:18  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 8
....持续了60秒没有收到ID为8的ACK,因此一直都是ACK output sequence broken: [12] 8
2013-07-24 16:54:15  Test证书/218.242.253.131:18014 TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
2013-07-24 16:54:15  Test证书/218.242.253.131:18014 TLS Error: TLS handshake failed with peer 0.0.0.0


没隔一段时间就会断一次,并且重连还不一定总能重连成功!因此这里的问题有两点:

a.连接正常时断开(ping-restart的情况,上述日志没有展示)
b.重连时不成功(上述日志展示的)

2.分析

使用UDP的OpenVPN就是事多,为了避免重传叠加,在恶劣环境下还真得用UDP。然而OpenVPN实现的UDP reliable层是一个高度简化的“按序确认连接”层,它仅仅确保了数据安序到达,并且有确认机制,和TCP那是没法比。不过如果看一下TCP最初的方案,你会发现,TCP的精髓其实就是OpenVPN的reliable层,后来的复杂性都是针对特定情况的优化!

        和TCP的实现一样,不对ACK进行ACK对发送端提出了重传滑动窗口未确认包的要求,因为纯ACK可能会丢失,这里先不讨论捎带ACK。ACK一旦丢失,发送端肯定就要重传没有被ACK的包,关键是“什么时候去重传它?”,协议本身一般都有一个或者多个Timer,Timer到期就重传,然而我个人认为这个Timer不能依赖上层,而要在协议本身实现,毕竟重传这种事对上层是不可见的!

        然而,OpenVPN的reliable层在ACK丢失的应对方面却什么都没有实现,通过以上的日志可以看出,连续的:

Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 8

说明ID为8的包一直都得不到重传,并且从:

2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 6 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=10 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [11] 9 10 7 8
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 7 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 WRITE [114] to 218.242.253.131:18014: P_CONTROL_V1 kid=0 [ ] pid=11 DATA len=100
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 9 10 11 8
2013-07-24 16:53:16  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 9 ]
2013-07-24 16:53:16  Test证书/218.242.253.131:18014 ACK output sequence broken: [12] 10 11 8
2013-07-24 16:53:17  MULTI: REAP range 240 -> 256
2013-07-24 16:53:17  GET INST BY REAL: 218.242.253.131:18014 [succeeded]
2013-07-24 16:53:17  Test证书/218.242.253.131:18014 UDPv4 READ [22] from 218.242.253.131:18014: P_ACK_V1 kid=0 [ 10 ]


这几行日志可以看出,确实是没有收到ID为8的包地ACK,说明它丢失了,接下来发送的数据包将持续填充发送窗口,直到填满,ID为8的包还未重传并且收到对端对其的ACK,因此就导致了ACK output sequence broken,通过查代码,12-8=4,而4正是发送窗口的长度。持续了很久ACK output sequence broken之后,还是没有重传,直到:

a.隧道建立之后的ping-restart过期
b.隧道建立阶段的TLS handshake failed


实际上,正确的方式应该是,检测到窗口爆满就应该马上重传。TCP通过三次重复ACK知晓丢包,而OpenVPN的reliable则通过ACK output sequence broken知晓ACK丢失,这是一个信号,应该在获取这个信号后做点什么了!

3.原始方案

方案很简单,那就是在打印ACK output sequence broken的逻辑块内重传所有未确认的包,然而作为一种优化,仅仅重传ID最小的包即可。这是因为,之所以到达ACK output sequence broken,是因为窗口满了,之所以满是因为ID最小的包未确认,占据了很大的一块空间以及其后面的实际上可能已经确认了的空间,因此只要ID最小的包被确认,窗口就放开了,故而仅重传ID最小的包,以期待对端能再次给出确认。

        方案虽简单,但是不落实到代码还是0,以下是一些尝试

4.第一次尝试-出错

7月25日下班后,又睡不着了,自己躲在女儿的小屋,开始了coding。首先确认一下对于乱序或者重放的包,对端也能ACK,如果不能,那就要大改了,找到了ssl.c的代码,在tls_pre_decrypt中:

if (op != P_ACK_V1 && reliable_can_get (ks->rec_reliable)) {
    packet_id_type id;

    /* Extract the packet ID from the packet */
    if (reliable_ack_read_packet_id (buf, &id)) {
        /* Avoid deadlock by rejecting packet that would de-sequentialize receive buffer */
        if (reliable_wont_break_sequentiality (ks->rec_reliable, id)) {
            if (reliable_not_replay (ks->rec_reliable, id)) {
                /* Save incoming ciphertext packet to reliable buffer */
                struct buffer *in = reliable_get_buf (ks->rec_reliable);
                ASSERT (in);
                ASSERT (buf_copy (in, buf));
                reliable_mark_active_incoming (ks->rec_reliable, in, id, op);
            }
            //注意这个注释,即使是重放包也ACK!而我解决ACK丢失的思路正是重放那个迟迟收不到
            //ACK的包,期待对端发送ACK,按照随机丢包概率,针对该包的ACK总不能一直丢失吧!
            /* Process outgoing acknowledgment for packet just received, even if it's a replay */
            reliable_ack_acknowledge_packet_id (ks->rec_ack, id);
        }
    }
}

有了以上的基础,起码我知道,针对OpenVPN的reliable层修改的代码不多!接下来就是找到修改哪里了,当然是哪里出问题修改哪里!之所以僵持在那里,就是因为“ACK output sequence broken”,所以说我找到了打印这个的地方,在reliable_get_buf_output_sequenced函数中:

struct buffer *
reliable_get_buf_output_sequenced (struct reliable *rel)
{
    struct gc_arena gc = gc_new ();
    int i;
    packet_id_type min_id = 0;
    bool min_id_defined = false;
    struct buffer *ret = NULL;

    /* find minimum active packet_id */
    for (i = 0; i < rel->size; ++i) {
        const struct reliable_entry *e = &rel->array[i];
        if (e->active) {
            if (!min_id_defined || e->packet_id < min_id) {
                min_id_defined = true;
                min_id = e->packet_id;
            }
        }
    }
    //以下判断没有通过的原因,在上面的日志中已经找到了:
    // ... ACK output sequence broken: [12] 8
    //12-8=4,而#define TLS_RELIABLE_N_SEND_BUFFERS  4
    if (!min_id_defined || (int)(rel->packet_id - min_id) < rel->size) {
        ret = reliable_get_buf (rel);
    } else {
        dmsg (D_REL_LOW, "ACK output sequence broken: %s", reliable_print_ids (rel, &gc));
    }
    gc_free (&gc);
    return ret;
}

因此仅仅需要在打印broken的地方重传packet_id为min_id的那个buf即可!

#ifdef RETRY
struct buffer *
reliable_get_buf_output_sequenced (struct reliable *rel, int *flag)
#else
struct buffer *
reliable_get_buf_output_sequenced (struct reliable *rel)
#endif
{
    struct gc_arena gc = gc_new ();
    int i;
    packet_id_type min_id = 0;
    bool min_id_defined = false;
    struct buffer *ret = NULL;
#ifdef RETRY
    struct buffer *retry_buff = NULL; //not named replay_buffer!
    *flag = 0;
#endif
    /* find minimum active packet_id */
    for (i = 0; i < rel->size; ++i) {
        const struct reliable_entry *e = &rel->array[i];
        if (e->active) {
            if (!min_id_defined || e->packet_id < min_id) {
                min_id_defined = true;
                min_id = e->packet_id;
#ifdef RETRY
                //retry_buff = e->buf;
                ret = &e->buf;
#endif
            }
        }
    }
    //以下判断没有通过的原因,在上面的日志中已经找到了:
    // ... ACK output sequence broken: [12] 8
    //12-8=4,而#define TLS_RELIABLE_N_SEND_BUFFERS  4
    if (!min_id_defined || (int)(rel->packet_id - min_id) < rel->size) {
        ret = reliable_get_buf (rel);
    } else {
#ifdef RETRY
        *flag = 1;
#endif
        dmsg (D_REL_LOW, "ACK output sequence broken: %s", reliable_print_ids (rel, &gc));
    }
    gc_free (&gc);
    return ret;
}

相应地,需要修改该函数的调用逻辑,即ssl.c的tls_process,这里不再给出ks->state == S_INITIAL的初始情况:

if (ks->state >= S_START) {
#ifdef RETRY
    int retry = 0;
    int status = -1;
    buf = reliable_get_buf_output_sequenced (ks->send_reliable, &retry);
#else
    buf = reliable_get_buf_output_sequenced (ks->send_reliable);
#endif
    if (buf) {
#ifdef RETRY
        if (!retry) {
#endif
            status = key_state_read_ciphertext (multi, ks, buf, PAYLOAD_SIZE_DYNAMIC (&multi->opt.frame));
            if (status == -1) {
                msg (D_TLS_ERRORS,
                       "TLS Error: Ciphertext -> reliable TCP/UDP transport read error");
                goto error;
            }
#ifdef RETRY
        } else {
            status = 1;
        }
#endif
        if (status == 1) {
            reliable_mark_active_outgoing (ks->send_reliable, buf, P_CONTROL_V1);
                INCR_GENERATED;
            state_change = true;
            dmsg (D_TLS_DEBUG, "Outgoing Ciphertext -> Reliable");
        }
    }
}

洋洋洒洒的不合我风格的规整代码,COOL!可是运行之后,ASSERT失败,明明我重发了ID最小的包,却在write_control_auth的:

ASSERT (session_id_write_prepend (&session->session_id, buf));

这一句华丽丽得退出!发现buf竟然不是我要重传的那个buffer!作为单线程单进程的OpenVPN,不可能有另外什么地方触动这个buf啊!

5.第二次尝试-成功

失败!夜以沉默,心思向谁说?

        然而这个问题没有那么复杂,案件的侦破很简单,那就是看代码,终于找到了reliable_schedule_now函数,关键是它的注释:

/* schedule all pending packets for immediate retransmit */

重传!对的,是重传!既然OpenVPN本身有了重传,那么我的那个重传就是多此一举了!因此还是按照步骤来吧,直接调用这个接口即可,话说一定要用既有的接口,千万不要重复实现既有逻辑!于是patch变得更加简单了,仅仅修改一个reliable_get_buf_output_sequenced函数即可:

struct buffer *
reliable_get_buf_output_sequenced (struct reliable *rel)
{
    struct gc_arena gc = gc_new ();
    int i;
    packet_id_type min_id = 0;
    bool min_id_defined = false;
    struct buffer *ret = NULL;

    /* find minimum active packet_id */
    for (i = 0; i < rel->size; ++i) {
        const struct reliable_entry *e = &rel->array[i];
        if (e->active) {
            if (!min_id_defined || e->packet_id < min_id) {
                min_id_defined = true;
                min_id = e->packet_id;
            }
        }
    }
    if (!min_id_defined || (int)(rel->packet_id - min_id) < rel->size) {
        ret = reliable_get_buf (rel);
    } else {
#ifdef RETRY
        reliable_schedule_now (rel);
        //顺便把日志也改了
        dmsg (D_REL_LOW, "ACK output sequence broken: %s, retransmit immediately", reliable_print_ids (rel, &gc));
#else
        dmsg (D_REL_LOW, "ACK output sequence broken: %s", reliable_print_ids (rel, &gc));
#endif
    }
    gc_free (&gc);
    return ret;
}

struct buffer *
reliable_get_buf_output_sequenced (struct reliable *rel)
{
    struct gc_arena gc = gc_new ();
    int i;
    packet_id_type min_id = 0;
    bool min_id_defined = false;
    struct buffer *ret = NULL;

    /* find minimum active packet_id */
    for (i = 0; i < rel->size; ++i) {
        const struct reliable_entry *e = &rel->array[i];
        if (e->active) {
            if (!min_id_defined || e->packet_id < min_id) {
                min_id_defined = true;
                min_id = e->packet_id;
            }
        }
    }
    if (!min_id_defined || (int)(rel->packet_id - min_id) < rel->size) {
        ret = reliable_get_buf (rel);
    } else {
#ifdef RETRY
        reliable_schedule_now (rel);
        //顺便把日志也改了
        dmsg (D_REL_LOW, "ACK output sequence broken: %s, retransmit immediately", reliable_print_ids (rel, &gc));
#else
        dmsg (D_REL_LOW, "ACK output sequence broken: %s", reliable_print_ids (rel, &gc));
#endif
    }
    gc_free (&gc);
    return ret;
}

别的什么都不用改!

6.优化(去掉副作用)

上面的修改虽然简单,但是带来了一个副作用,那就是一次broken会带来reliable窗口里面所有的包都会重传,其实我们只需要重传ID最小的那个就可以了,毕竟它是罪魁祸首!如果因为网络环境导致的ACK丢失,继而重传了所有的包,可能会带来更多的丢包,这个在TCP上体现最深刻了,因此只重传最小ID的那个包,既然它的ACK丢失导致了broken,那么就再发它一次,保证网络管道上的包数量守恒!另外,如果毫无判断地重传,可能会误判很多ACK丢包,其实有些ID稍微大些的ACK并没有丢,它只是乱序到达了而已!

        没有免费的午餐,因此不得不作的就是修改reliable_schedule_now逻辑:

struct buffer *
reliable_get_buf_output_sequenced (struct reliable *rel)
{
 ...
#ifdef RETRY
        reliable_schedule_id (rel, 0, min_id);
        //顺便把日志也改了
        dmsg (D_REL_LOW, "ACK output sequence broken: %s, retransmit immediately", reliable_print_ids (rel, &gc));
#else
        dmsg (D_REL_LOW, "ACK output sequence broken: %s", reliable_print_ids (rel, &gc));
#endif
...
}

struct buffer *
reliable_get_buf_output_sequenced (struct reliable *rel)
{
 ...
#ifdef RETRY
        reliable_schedule_id (rel, 0, min_id);
        //顺便把日志也改了
        dmsg (D_REL_LOW, "ACK output sequence broken: %s, retransmit immediately", reliable_print_ids (rel, &gc));
#else
        dmsg (D_REL_LOW, "ACK output sequence broken: %s", reliable_print_ids (rel, &gc));
#endif
...
}

真正的修改是schedule逻辑:

#ifdef RETRY
void
reliable_schedule_id (struct reliable *rel, time_t timeout, packet_id_type id)
{
  int i;
  dmsg (D_REL_DEBUG, "ACK reliable_schedule_now");
  rel->hold = false;
  for (i = 0; i < rel->size; ++i)
    {
      struct reliable_entry *e = &rel->array[i];
      if (e->active)
        {
        if (id != 0 && e->packet_id == id) {
          e->next_try = now + timeout;
          e->timeout = rel->initial_timeout;
                break;
        } else {
          e->next_try = now + timeout;
          e->timeout = rel->initial_timeout;
        }
        }
    }
}

#ifdef RETRY
void
reliable_schedule_id (struct reliable *rel, time_t timeout, packet_id_type id)
{
  int i;
  dmsg (D_REL_DEBUG, "ACK reliable_schedule_now");
  rel->hold = false;
  for (i = 0; i < rel->size; ++i)
    {
      struct reliable_entry *e = &rel->array[i];
      if (e->active)
        {
        if (id != 0 && e->packet_id == id) {
          e->next_try = now + timeout;
          e->timeout = rel->initial_timeout;
                break;
        } else {
          e->next_try = now + timeout;
          e->timeout = rel->initial_timeout;
        }
        }
    }
}
#endif
/* schedule all pending packets for immediate retransmit */
void
reliable_schedule_now (struct reliable *rel)
{
#ifdef RETRY
#调用新接口函数
        return reliable_schedule_id (rel, 0, 0);
#else
  int i;
  dmsg (D_REL_DEBUG, "ACK reliable_schedule_now");
  rel->hold = false;
  for (i = 0; i < rel->size; ++i)
    {
      struct reliable_entry *e = &rel->array[i];
      if (e->active)
        {
          e->next_try = now;
          e->timeout = rel->initial_timeout;
        }
    }
#endif
}

新的reliable_schedule_id有3个参数,增加了一个timeout参数用来微调备用,另一个id参数用来指示哪一个ID的包需要重传。

7.说明

这次我给出的修改仅仅解决ACK丢失的问题,至于说乱序或者别的情况导致的问题,那还得需要上层的SSL发现并reset。如果说本来链路带宽就很低,且ping-restart时间又很短,那么解决办法就是增加ping-restart时间了。说白了本解决方案仅仅是“在由于ACK丢失导致发送窗口爆满情况下,尽快重传使窗口尽快恢复可用”。

8.测试

改完了代码只是说按照常规理论完成了修改,至于说有没有什么效果,还要测试数据来说明。今天公司上海对北京的公网测试环境不给力,我也总不能把希望全部寄托在公司的环境,我要模拟恶劣网络的丢包现象。

鉴于对iptables的舍我其谁般的理解,我这次还是用了iptables,这次用的是其statistic match,测试拓扑如下:

OpenVPN莫名其妙断线的问题及其解决

其中中间的Linux Box作为一个路由器,顺便模拟恶劣网络,以下的规则模拟了按照百分比丢包的现场:

#模拟30%左右的丢包,这个丢包率已经相当大了!

iptables -t mangle -A PREROUTING -m statistic --mode random --probability 0.3 -j MARK --set-mark 100
iptables -t filter -A FORWARD -p udp -m mark --mark 100 -j DROP


当然如果只有两台直连机器甚至只有一台机器,也是可以模拟的,那样的话,上述的iptables规则就只能在接收机作了,对于只有一台机器的情况,只能在loopback口的PREROUTING和INPUT上做。模拟UDP丢包不能在OUTPUT上做,因为那样会导致返回应用层Operation not permitted,最终导致不是reliable,而是应用层发现这个错误后重发UDP包。单机使用lo口测试的iptables规则如下:

iptables -t mangle -A PREROUTING -m statistic --mode random --probability 0.3 -j MARK --set-mark 100
iptables -t filter -A INPUT -p udp -m mark --mark 100 -j DROP


        首先尝试标准的OpenVPN连接,会发现大量的ACK output sequence broken,有时这些broken由于retransmit而被恢复,但是大多数情况,在ping-restart时间内不恢复的话,连接就断开了,甚至在初始化连接的时候都要很久。接下来用我修改后的burst retry版本来测试,发现也会有broken报错,但是这些报错并不连续,而且基本都是瞬间恢复,初始连接也很快建立!连接之后,也会断。我们可以不那么极端,30%的丢包率说明网络的可用性极低,因此要把丢包率设置在一般的10%范围以内,于是上述的0.3也就改成了0.08,这样测试下来,retry版本的几乎不会断,而标准版本的时而会ping-restart!此时的ping频率是每2秒一次,ping-restart是10秒,按照这个0.08%丢包率的话,ping-restart设置成5秒是很危险的,即使是retry版本也会断线,因为这是正常的断线,毕竟只要丢掉一个PING,就可能断线,ping-restart时间太短了!

        然而将丢包率设置到50%之后,即使retry版的也会出现断线...话又说回来,标准的OpenVPN此时根本就连接不上!

补充:第9节的优化到底有没有意义?这是一个问题!在恶劣的网络环境下(低带宽,高丢包率,注意,UDP是无限制抢带宽的,没有拥控和流控,即使reliable也没有实现),没有那个优化的版本,也就是第8节介绍的版本效果更好,因为重传总是有更高的几率避开丢包,然而会导致抢带宽的问题,站在全局公平的角度,采用优化的retry版本,然而站在自私的角度,采用原始的retry版本!

9.如果我是运营商

如果我是运营商,我会在核心上设置策略,对非核心用户的UDP进行加权百分比的随机丢包,以保障TCP以及核心应用,鉴于以下几点:

a.UDP本身没有流控和拥控,不加以管制的话,UDP最终会吃掉整个带宽,让TCP之类的拥控协议认为网络拥堵,自行减速。
b.UDP无状态,无法很好的通过五元组监控来控制UDP。
c.UDP和组播天生是一对,二者可能会做出一些见不得人的勾当。
d.受管制的TCP会伪装在UDP中,招摇过市。


因此对UDP进行管控是必要的!这难道仅仅是我的猜想吗?NO!实际上我真的是和运营商的想法不谋而合了(体现了我对网络的理解还算可以),很多的运行商真的就是这么干的。那么问题就来了,在OpenVPN中,默认push给客户端的ping间隔是10秒,ping-restart是120秒,在10%的UDP随机丢包率下,连续丢失12个包虽不经常,也不是不可能,因此只能靠多发送来平滑掉丢包率了,因此我把ping间隔设置成4秒,ping-restart不变。

        如果不行,我就将ping间隔设置成1秒,运营商们,对不起了,UDP炸弹可比TCP synflood厉害啊,后者仅仅影响端系统,而UDP炸弹却能影响整个网络。

10.遗留的问题

导致OpenVPN断线的不仅仅是上述查出来的一个OpenVPN的重传问题,其实还有一个ping-restart的时间问题,如果在一个正常的低速环境中,比如低速DDN专线或者低速ISDN中,给5秒的ping-restart时间实在是太短,很容易就超时了,毕竟链路本身就有限,上面还要跑很多业务流。因此到底是什么导致了OpenVPN断线,并不是一个答案唯一的问题,多种因素都会导致断线!