故障分析 | MySQL 主从不同步分析案例

时间:2023-02-08 17:04:27

背景

线上某核心业务采用了自建MySQL 主从架构 ,因这个业务库主要是和第三方合作项目,无监控权限,DB权限需要临时申请才可操作

现象分析

  1. 项目组里反馈 业务数据最近两天无新数据产生。
  2. 推测如下:
  1. 此业务这两天确实无实际数据产生
  2. 主从同步异常

诊断

  1. 经查业务主库的binlog有变化,说明最近两天有数据
  2. 目前业务库拓扑结构如下:

故障分析 | MySQL 主从不同步分析案例

排查与处理

本次无数据是mysql07-bi-prod-bj1和主库不同步导致。不同步告警如下:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.23.70.101
                  Master_User: repl_user
                  Master_Port: 3306
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1396
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '89d024cb-8865-11e9-babc-fa83d2adf000:33387663' at master log mysql_bin.000407, end_log_pos 427666467. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.

详细查看告警信息:    

mysql> select * from performance_schema.replication_applier_status_by_worker limit 1\G
*************************** 1. row ***************************
                                           CHANNEL_NAME:
                                              WORKER_ID: 1
                                              THREAD_ID: NULL
                                          SERVICE_STATE: OFF
                                      LAST_ERROR_NUMBER: 1396
                                     LAST_ERROR_MESSAGE: Worker 1 failed executing transaction '89d024cb-8865-11e9-babc-fa83d2adf000:33387663' at master log mysql_bin.000407, end_log_pos 427666467; Error 'Operation CREATE USER failed for 'app_sqoop_r'@'10.33.75.%'' on query. Default database: ''. Query: 'CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED WITH 'mysql_native_password' AS '*463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA''
1 row in set (0.00 sec)

根据告警提示,查看master log mysql_bin.000407, end_log_pos 427666467前执行了什么。

解析主库mysql_bin.000407如下:

mysqlbinlog --no-defaults -vv  --base64-output=decode-rows \
>   /data/mysql/mysql57_3306/binlog/mysql_bin.000407 --stop-position=427666467
# at 427666247
#200224 11:59:51 server id 70101  end_log_pos 427666467 CRC32 0x3c9eaf42        Query   thread_id=76726 exec_time=0     error_code=0    Xid = 57596658
SET TIMESTAMP=1582516791.092605/*!*/;
CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED WITH 'mysql_native_password' AS '*463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA'
/*!*/;
SET @@SESSION.GTID_NEXT= 'AUTOMATIC' /* added by mysqlbinlog */ /*!*/;
DELIMITER ;
# End of log file
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
DELIMITER /*!*/;
# at 427666467
GRANT SELECT ON `xyy_crm`.* TO 'app_bi_w'@'10.33.75.%'
/*!*/;
# at 427666705

如上在create user时,从库2与主库不同步,检查从库2确实已经有'app_sqoop_r'@'10.33.75.%'用户,且权限与主库不同

查看主库的用户权限

mysql> show grants for app_sqoop_r@'10.33.75.%';
+-----------------------------------------------------------+
| Grants for app_sqoop_r@10.33.75.%                         |
+-----------------------------------------------------------+
| GRANT USAGE ON *.* TO `app_sqoop_r`@`10.33.75.%`          |
| GRANT SELECT ON `xyy_crm`.* TO `app_sqoop_r`@`10.33.75.%` |

mysql07-bi-prod-bj1从库2的用户权限
mysql> show grants for app_sqoop_r@'10.33.75.%';
 
| GRANT SELECT ON *.* TO `app_sqoop_r`@`10.33.75.%` |

至此 已经定位具体的问题:用户是直接在从库2建了'app_sqoop_r'@'10.33.75.%'用户所导致的不同步,接下来只要跳过该事物即可

[root@localhost] {10:48:04} ((none)) [2]> stop slave;
Query OK, 0 rows affected (0.00 sec)
= [root@localhost] {10:48:57} ((none)) [5]> set gtid_next='89d024cb-8865-11e9-babc-fa83d2adf000:33387663';
Query OK, 0 rows affected (0.00 sec)
 
= [root@localhost] {10:49:11} ((none)) [6]> begin;commit;
Query OK, 0 rows affected (0.00 sec)
 
Query OK, 0 rows affected (0.00 sec)
 
= [root@localhost] {10:49:15} ((none)) [7]> set gtid_next='AUTOMATIC';
Query OK, 0 rows affected (0.00 sec)
 
= [root@localhost] {10:49:23} ((none)) [8]> start slave;
Query OK, 0 rows affected (0.04 sec)

再查看主从同步:

[root@localhost] {10:49:26} ((none)) [9]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.23.70.101
                  Master_User: repl_user
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql_bin.000411
          Read_Master_Log_Pos: 302997670
               Relay_Log_File: relay.000340
                Relay_Log_Pos: 448617544
        Relay_Master_Log_File: mysql_bin.000407
             Slave_IO_Running: Yes ### 此时查看 主从同步已正常
            Slave_SQL_Running: Yes ### 此时查看 主从同步已正常
         Seconds_Behind_Master: 166878

测试验证

针对以上问题做出以下测试与验证

先说结论如下:

  1. 验证1:若先在从库create,然后grant;主库create用户查看主从关系。(结论是主从异常)
  2. 验证2:若在从库grant方式创建并授权用户;主库grant,查看主从关系(结论是主从正常)
  3. 验证3:若从库grant方式创建并授权用户;主库create用户,查看主从关系(结论是主从异常)

验证1:

当从库已经创建了用户,主库创建用户是否会出现Last_Errno: 1396

构造场景1

从库执行:

root@localhost] {16:04:33} ((none)) [12]> CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED by password '*463A80AB6D77E0D99CDB0BD426EB50AC9243A4EA';
Query OK, 0 rows affected, 1 warning (0.07 sec)
 
[root@localhost] {16:05:18} ((none)) [13]> grant select on *.* to 'app_sqoop_r'@'10.33.75.%';
Query OK, 0 rows affected (0.00 sec)

主库执行

mysql> CREATE USER 'app_sqoop_r'@'10.33.75.%' IDENTIFIED by '*463A80AB6D77E0D99CD';
Query OK, 0 rows affected (0.00 sec)

此时查看主从关系

[root@localhost] {16:05:34} ((none)) [14]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mha2-db-test-bj2
                  Master_User: repl_user
                  Master_Port: 3300
                Connect_Retry: 60
              Master_Log_File: mysql_bin.000005
          Read_Master_Log_Pos: 192344
               Relay_Log_File: relay.000003
                Relay_Log_Pos: 192176
        Relay_Master_Log_File: mysql_bin.000005
             Slave_IO_Running: Yes
            Slave_SQL_Running: No #### 主从关系是异常的
                           Last_Errno: 1396
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'fad94351-9975-11e9-9171-fa1b8143c800:5' at master log mysql_bin.000005, end_log_pos 192344. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论1:

主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396

验证2

通过grant创建并授权用户

构造场景2

从库执行

[root@localhost] {16:21:22} ((none)) [36]> grant select on ost_test.* to app_test_r@'10.%' identified by  '3A80AB6D77E0D99CD';
Query OK, 0 rows affected, 1 warning (0.00 sec)

主库执行

mysql> grant insert on `ost_test`.* TO 'app_test_r'@'10.%' identified by  '5A80AB6D77E0D99CD';
Query OK, 0 rows affected, 1 warning (0.00 sec)

此时查看主从关系

[root@localhost] {16:21:53} ((none)) [37]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mha2-db-test-bj2
                  Master_User: repl_user
                  Master_Port: 3300
                Connect_Retry: 60
              Master_Log_File: mysql_bin.000005
          Read_Master_Log_Pos: 193391
               Relay_Log_File: relay.000004
                Relay_Log_Pos: 1501
        Relay_Master_Log_File: mysql_bin.000005
             Slave_IO_Running: Yes ### 主从同步正常
            Slave_SQL_Running: Yes ### 主从同步正常
结论2

主从同步正常,不影响数据同步

验证3

在从库grant方式创建并授权用户;主库create用户

构造场景3

从库执行

[root@localhost] {16:30:45} ((none)) [41]> grant select on ost_test.* to app_test_r@'10.%' identified by  '3A80AB6D77E0D99CD';
Query OK, 0 rows affected, 1 warning (0.00 sec)

主库执行

mysql> create user app_test_r@'10.%' identified by  '3A80AB6D77E0D99';
Query OK, 0 rows affected (0.00 sec)

查看主从关系

[root@localhost] {16:30:59} ((none)) [42]> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: mha2-db-test-bj2
                  Master_User: repl_user
                  Master_Port: 3300
                Connect_Retry: 60
             Slave_IO_Running: Yes
            Slave_SQL_Running: No ### 主从异常
                   Last_Errno: 1396
                   Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'fad94351-9975-11e9-9171-fa1b8143c800:11' at master log mysql_bin.000005, end_log_pos 193808. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
结论3

主从同步异常: Slave_SQL_Running: No,Last_Errno: 1396

总结

  1. 针对生产环境的情况在不清楚主从关系情况下,
  1. 先用show slave status\G检测是否是主库
  2. 用select * from information_schema.​​PROCESSLIST​​ where state like '%binlog%'\G检测从库IP端口等信息。
  1. 保证执行所有的语句一定要在主库执行!!!