DRBD脑裂问题处理记录

时间:2022-11-19 14:45:50

环境:

DRBD资源池名称:jcluster
主节点primary

从节点secondary

挂载目录 /data

主要用到的命令:

service drbd start
service drbd stop
service drbd status
service mysqld stop
查看链接状态服务
fuser -m -v /data/
无法umount时kill连接进程
fuser -m -v -k /data/
umount /data/
drbdadm connect jcluster //连接到DRBD资源池
drbdadm disconnect jcluster //断开资源池
drbdadm connect --discard-my-data jcluster //从节点同步主节点的数据,并且discard自己的数据

最近业务服务器重启,使用Pacemaker + DRBD + MySQL实现高可用,重启的时候现场人员启动没有连接心跳线,导致DRBD服务出现脑裂问题,drbd-overview、crm status查看现象,如下:
主节点:
[root@primary ~]# drbd-overview
  0:jcluster/0  StandAlone Primary/Unknown UpToDate/DUnknown r----- /data ext4 2.7T 2.0G 2.6T 1%

[root@primary ~]# crm status
Last updated: Thu Mar 30 10:03:18 2017
Last change: Tue Mar 28 10:25:46 2017
Stack: classic openais (with plugin)
Current DC: primary - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
6 Resources configured

Online: [ primary secondary ]

 Master/Slave Set: ms_drbd_just [drbd_just]
     Masters: [ primary ]
     Slaves: [ secondary ]
 Resource Group: justcall
     fs_just    (ocf::heartbeat:Filesystem):    Started primary
     ip_just    (ocf::heartbeat:IPaddr2):       Started primary
     crond_just (lsb:crond):    Started primary
     apache_just        (ocf::heartbeat:apache):        Started primary

Failed actions:
    drbd_just_monitor_30000 on secondary 'not running' (7): call=25, status=complete, last-rc-change='Thu Mar 30 10:01:36 2017', queued=0ms, exec=0ms


从节点:
[root@secondary ~]# drbd-overview
  0:jcluster/0  StandAlone Secondary/Unknown UpToDate/DUnknown r-----

[root@secondary ~]# crm status
Last updated: Thu Mar 30 10:02:58 2017
Last change: Tue Mar 28 10:25:46 2017
Stack: classic openais (with plugin)
Current DC: primary - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
6 Resources configured

Online: [ primary secondary ]

 Master/Slave Set: ms_drbd_just [drbd_just]
     Masters: [ primary ]
     Slaves: [ secondary ]
 Resource Group: justcall
     fs_just    (ocf::heartbeat:Filesystem):    Started primary
     ip_just    (ocf::heartbeat:IPaddr2):       Started primary
     crond_just (lsb:crond):    Started primary
     apache_just        (ocf::heartbeat:apache):        Started primary

Failed actions:
    drbd_just_monitor_30000 on secondary 'not running' (7): call=25, status=complete, last-rc-change='Thu Mar 30 10:01:36 2017', queued=0ms, exec=0ms

两个节点都处于StandAlone状态,主节点StandAlone Primary/Unknown UpToDate/DUnknown,从节点StandAlone Secondary/Unknown UpToDate/DUnknown
这种情况好处理,首先确认心跳线正常连接了,可以互相ping通。

主节点操作,连接到资源池
[root@primary ~]# drbdadm connect jcluster

过5-10分钟左右状态变成了
[root@primary ~]# drbd-overview
  0:jcluster/0  WFConnection Primary/Unknown UpToDate/DUnknown C r----- /data ext4 2.7T 2.0G 2.6T 1%

从节点状态,执行drbdadm connect --discard-my-data jcluster命令,从节点同步主节点的数据,并且discard自己的数据
[root@secondary ~]# drbd-overview
  0:jcluster/0  StandAlone Secondary/Unknown UpToDate/DUnknown r-----

从节点
[root@secondary ~]#drbdadm connect --discard-my-data jcluster

数据自动同步,数据量不大的话5分钟同步完成
[root@secondary ~]# drbd-overview
  0:jcluster/0  SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
        [>....................] sync'ed:  4.1% (4332/4512)M
[root@secondary ~]# drbd-overview
  0:jcluster/0  SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
        [=>..................] sync'ed: 10.9% (4024/4512)M
[root@secondary ~]# drbd-overview
  0:jcluster/0  SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
        [=>..................] sync'ed: 12.5% (3956/4512)M
[root@secondary ~]# drbd-overview
  0:jcluster/0  SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
        [=>..................] sync'ed: 13.4% (3912/4512)M
[root@secondary ~]# drbd-overview
  0:jcluster/0  SyncTarget Secondary/Primary Inconsistent/UpToDate C r-----
        [=>..................] sync'ed: 14.8% (3848/4512)M

同步后的状态
[root@primary ~]# drbd-overview
  0:jcluster/0  Connected Primary/Secondary UpToDate/UpToDate C r----- /data ext4 2.7T 2.0G 2.6T 1%

[root@secondary ~]# drbd-overview
  0:jcluster/0  Connected Secondary/Primary UpToDate/UpToDate C r-----

服务都正常了,Failed actions 报错是记录之前的报错。

[root@primary ~]# crm status
Last updated: Thu Mar 30 10:19:56 2017
Last change: Tue Mar 28 10:25:46 2017
Stack: classic openais (with plugin)
Current DC: primary - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
6 Resources configured

Online: [ primary secondary ]

 Master/Slave Set: ms_drbd_just [drbd_just]
     Masters: [ primary ]
     Slaves: [ secondary ]
 Resource Group: justcall
     fs_just    (ocf::heartbeat:Filesystem):    Started primary
     ip_just    (ocf::heartbeat:IPaddr2):       Started primary
     crond_just (lsb:crond):    Started primary
     apache_just        (ocf::heartbeat:apache):        Started primary

Failed actions:
    drbd_just_monitor_30000 on secondary 'not running' (7): call=25, status=complete, last-rc-change='Thu Mar 30 10:08:36 2017', queued=0ms, exec=0ms

[root@secondary ~]# crm status
Last updated: Thu Mar 30 10:20:31 2017
Last change: Tue Mar 28 10:25:46 2017
Stack: classic openais (with plugin)
Current DC: primary - partition with quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
6 Resources configured

Online: [ primary secondary ]

 Master/Slave Set: ms_drbd_just [drbd_just]
     Masters: [ primary ]
     Slaves: [ secondary ]
 Resource Group: justcall
     fs_just    (ocf::heartbeat:Filesystem):    Started primary
     ip_just    (ocf::heartbeat:IPaddr2):       Started primary
     crond_just (lsb:crond):    Started primary
     apache_just        (ocf::heartbeat:apache):        Started primary

Failed actions:
    drbd_just_monitor_30000 on secondary 'not running' (7): call=25, status=complete, last-rc-change='Thu Mar 30 10:08:36 2017', queued=0ms, exec=0ms

参考:
drbd脑裂恢复实例
http://blog.csdn.net/levy_cui/article/details/56484618

一次DRBD脑裂行为的模拟
http://myhat.blog.51cto.com/391263/606318/

drbd脑裂处理
http://itindex.net/detail/50197-drbd

记一次DRBD Unknown故障处理过程
http://koumm.blog.51cto.com/703525/1769112/


linux 下解决umount 时出现的 "Device is busy"问题
http://blog.csdn.net/mzpmzk/article/details/53892956