Orchestrator Lastest Slave选择逻辑"不合理"导致的数据丢失问题

2022-09-22 阅读量

Orchestrator Lastest Slave选择逻辑”不合理”导致的数据丢失问题

首先说明, “不合理”只是站在个人角度的结论. 到底合理不合理, 我说了不算.
以我对Orchestrator的了解, Orchestrator目标是追求可用性优先, 而非数据完整性. 很多公司也使用了Orchestrator, 我感觉未必知道有这个问题, 或者说, 别问, 问就是”我们追求可用性”.

现在的问题是, 即便开了半同步, 也可能丢数据.

然而矛盾的点是, 线上主从的复制延迟是大家都要监控和管理的, 不会长期处于高延迟状态, 起码我经历的公司都是这样, 99.9%的集群主从延迟在1s内. 个别集群在高峰期会升高一点, 但很快又会下降; 又或者这些集群本身就是AP型业务.

那么既然我们可以保证复制延迟小于1s, 根据ReadBinlogCoordinates选择Latest slave又能导致”恢复时间”增大多少呢? 而为了这X秒的快速恢复, 你又要花多少时间修复数据呢?

问题描述

我在2021年2月调研orchestrator时发现了这个问题, 问题的详细描述和复现方法见issue#1312, 我在这里简单描述一下.
原生orchestrator根据ExecBinlogCoordinates选择Latest Slave. 即ExecBinlogCoordinates大的为Latest Slave. 但ExecBinlogCoordinates大的副本不一定有最全的日志.

ExecBinlogCoordinates 是应用binlog坐标的意思
对应show slave status中的
 - Relay_Master_Log_File
 - Exec_Master_Log_Pos

表示sql_thread已经应用了主库哪个binlog哪个位置的日志.

假设我有一个一主两从复制集, 半同步开启, rpl_semi_sync_master_wait_for_slave_count = 1
M 是主, S1 和 S2 是从

在某个时间点:

S1 的 ExecBinlogCoordinates < S2 的 ExecBinlogCoordinate
S2 网络有问题. 一些binlog事件还没有写入S2的relay log，但是IO_THREAD状态还在running.

此时, 主库宕机, 原生orchestrator会选择S2作为新主库, 这会导致数据丢失

在学习orchestrator源码后, 我修改了Latest Slave的选择逻辑, 解决了上述问题, 修改内容详见fix-1312

测试环境说明

master fanboshi@127.0.0.1 15:25:08 [dbms_monitor]> select * from cluster;
+----+-----------+--------------+--------------------+-------------+-------------+--------------------+------+----------------+---------------------+---------------------+
| id | env       | cluster_name | cluster_xxx_domain | datacenter  | region      | hostname           | port | promotion_rule | D_UPDATED_AT        | D_CREATED_AT        |
+----+-----------+--------------+--------------------+-------------+-------------+--------------------+------+----------------+---------------------+---------------------+
|  1 | dev       | orc_dev_test | NULL               | cn-north-1a | 123456Cloud | bj1-mysql-test-01  | 3310 | neutral        | 2022-06-14 21:40:34 | 2022-06-14 21:40:34 |
|  2 | dev       | orc_dev_test | NULL               | cn-north-1a | 123456Cloud | bj1-mysql-test-03  | 3310 | neutral        | 2022-06-18 11:33:50 | 2022-06-14 21:40:34 |
|  3 | dev       | orc_dev_test | NULL               | cn-north-1a | 123456Cloud | bj1-mysql-test-02  | 3310 | neutral        | 2022-06-14 21:40:34 | 2022-06-14 21:40:34 |
+----+-----------+--------------+--------------------+-------------+-------------+--------------------+------+----------------+---------------------+---------------------+
3 rows in set (0.00 sec)

为了方便测试, 在所有从库

stop slave;
set global slave_net_timeout = 3600;
CHANGE MASTER TO MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400, MASTER_HEARTBEAT_PERIOD=1800;
start slave;

slave1 root@localhost 15:46:17 [(none)]> show global status like '%semi%';
+--------------------------------------------+----------+
| Variable_name                              | Value    |
+--------------------------------------------+----------+
| Rpl_semi_sync_master_clients               | 0        |
| Rpl_semi_sync_master_net_avg_wait_time     | 0        |
| Rpl_semi_sync_master_net_wait_time         | 0        |
| Rpl_semi_sync_master_net_waits             | 3074     |
| Rpl_semi_sync_master_no_times              | 3        |
| Rpl_semi_sync_master_no_tx                 | 0        |
| Rpl_semi_sync_master_status                | OFF      |
| Rpl_semi_sync_master_timefunc_failures     | 0        |
| Rpl_semi_sync_master_tx_avg_wait_time      | 16165    |
| Rpl_semi_sync_master_tx_wait_time          | 47722030 |
| Rpl_semi_sync_master_tx_waits              | 2952     |
| Rpl_semi_sync_master_wait_pos_backtraverse | 3        |
| Rpl_semi_sync_master_wait_sessions         | 0        |
| Rpl_semi_sync_master_yes_tx                | 2950     |
| Rpl_semi_sync_slave_status                 | ON       |
+--------------------------------------------+----------+
15 rows in set (0.00 sec)
 
slave1 root@localhost 15:48:16 [(none)]> select * from mysql.slave_master_info\G
*************************** 1. row ***************************
       Number_of_lines: 25
       Master_log_name: mysql-bin.000001
        Master_log_pos: 154
                  Host: 192.168.xx.93
             User_name: repler
         User_password: superpass
                  Port: 3310
         Connect_retry: 1
           Enabled_ssl: 0
                Ssl_ca:
            Ssl_capath:
              Ssl_cert:
            Ssl_cipher:
               Ssl_key:
Ssl_verify_server_cert: 0
             Heartbeat: 1800
                  Bind:
    Ignored_server_ids: 0
                  Uuid: 635da820-ebe7-11ec-8550-fa163e6d6e43
           Retry_count: 86400
               Ssl_crl:
           Ssl_crlpath:
 Enabled_auto_position: 1
          Channel_name:
           Tls_version:
1 row in set (0.00 sec)

slave2 root@localhost 15:46:20 [(none)]> show global status like '%semi%';
+--------------------------------------------+-------+
| Variable_name                              | Value |
+--------------------------------------------+-------+
| Rpl_semi_sync_master_clients               | 0     |
| Rpl_semi_sync_master_net_avg_wait_time     | 0     |
| Rpl_semi_sync_master_net_wait_time         | 0     |
| Rpl_semi_sync_master_net_waits             | 0     |
| Rpl_semi_sync_master_no_times              | 0     |
| Rpl_semi_sync_master_no_tx                 | 0     |
| Rpl_semi_sync_master_status                | OFF   |
| Rpl_semi_sync_master_timefunc_failures     | 0     |
| Rpl_semi_sync_master_tx_avg_wait_time      | 0     |
| Rpl_semi_sync_master_tx_wait_time          | 0     |
| Rpl_semi_sync_master_tx_waits              | 0     |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0     |
| Rpl_semi_sync_master_wait_sessions         | 0     |
| Rpl_semi_sync_master_yes_tx                | 0     |
| Rpl_semi_sync_slave_status                 | ON    |
+--------------------------------------------+-------+
15 rows in set (0.01 sec)
 
slave2 root@localhost 15:48:21 [(none)]> select * from mysql.slave_master_info\G
*************************** 1. row ***************************
       Number_of_lines: 25
       Master_log_name: mysql-bin.000001
        Master_log_pos: 154
                  Host: 192.168.xx.93
             User_name: repler
         User_password: superpass
                  Port: 3310
         Connect_retry: 1
           Enabled_ssl: 0
                Ssl_ca:
            Ssl_capath:
              Ssl_cert:
            Ssl_cipher:
               Ssl_key:
Ssl_verify_server_cert: 0
             Heartbeat: 1800
                  Bind:
    Ignored_server_ids: 0
                  Uuid: 635da820-ebe7-11ec-8550-fa163e6d6e43
           Retry_count: 86400
               Ssl_crl:
           Ssl_crlpath:
 Enabled_auto_position: 1
          Channel_name:
           Tls_version:
1 row in set (0.00 sec)

主库设置

set global rpl_semi_sync_master_timeout=4294967295;

master fanboshi@127.0.0.1 15:47:08 [dbms_monitor]> set global rpl_semi_sync_master_timeout=4294967295;
Query OK, 0 rows affected (0.00 sec)
 
fanboshi@127.0.0.1 15:47:30 [dbms_monitor]> show global status like '%semi%';
+--------------------------------------------+-----------+
| Variable_name                              | Value     |
+--------------------------------------------+-----------+
| Rpl_semi_sync_master_clients               | 2         |
| Rpl_semi_sync_master_net_avg_wait_time     | 0         |
| Rpl_semi_sync_master_net_wait_time         | 0         |
| Rpl_semi_sync_master_net_waits             | 2689143   |
| Rpl_semi_sync_master_no_times              | 4         |
| Rpl_semi_sync_master_no_tx                 | 1         |
| Rpl_semi_sync_master_status                | ON        |
| Rpl_semi_sync_master_timefunc_failures     | 0         |
| Rpl_semi_sync_master_tx_avg_wait_time      | 439       |
| Rpl_semi_sync_master_tx_wait_time          | 592895132 |
| Rpl_semi_sync_master_tx_waits              | 1349976   |
| Rpl_semi_sync_master_wait_pos_backtraverse | 3         |
| Rpl_semi_sync_master_wait_sessions         | 0         |
| Rpl_semi_sync_master_yes_tx                | 1349975   |
| Rpl_semi_sync_slave_status                 | OFF       |
+--------------------------------------------+-----------+
15 rows in set (0.00 sec)
 
master fanboshi@127.0.0.1 15:47:58 [dbms_monitor]> show global variables like '%semi%';
+-------------------------------------------+------------+
| Variable_name                             | Value      |
+-------------------------------------------+------------+
| rpl_semi_sync_master_enabled              | ON         |
| rpl_semi_sync_master_timeout              | 4294967295 |
| rpl_semi_sync_master_trace_level          | 32         |
| rpl_semi_sync_master_wait_for_slave_count | 1          |
| rpl_semi_sync_master_wait_no_slave        | ON         |
| rpl_semi_sync_master_wait_point           | AFTER_SYNC |
| rpl_semi_sync_slave_enabled               | ON         |
| rpl_semi_sync_slave_trace_level           | 32         |
+-------------------------------------------+------------+
8 rows in set (0.00 sec)

通过一个脚本模拟数据写入test.t1表

cat insert.py
#!/root/.pyenv/versions/myrecover/bin/python
# -*- coding: utf8 -*-
# __author__ = 'Fan()'
# Date: 2022-06-01
 
import time
import pymysql
 
conn = pymysql.connect(host='192.168.xx.93', port=16033, user='dbms_monitor_rw', password='superpass',
                       db='test', connect_timeout=1, read_timeout=1, write_timeout=1,
                       autocommit=True, charset='utf8')
cursor = conn.cursor(pymysql.cursors.DictCursor)
 
if __name__ == '__main__':
    while True:
        time.sleep(0.5)
        try:
            conn.ping(reconnect=True)
            cursor.execute("replace into t1(id, ctime) values(1, now());")
        except pymysql.Error as e:
            sql = self.cursor.mogrify(sql)
            print(sql)

原生orchestrator问题复现

在s1, 执行以下语句, 锁住t1表

1	slave1 root@localhost 20:57:04 [test]> begin;lock table test.t1 read;

此后, s1 由于被锁阻塞, ExecBinlogCoordinates不在变化, 但仍然可以正常接收主库发来的日志

slave1 root@localhost 20:57:12 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave1 root@localhost 20:57:12 [(none)]> show slave status\G
              Master_Log_File: mysql-bin.000008
          Read_Master_Log_Pos: 4274259
               Relay_Log_File: mysql-relay-bin.000003
                Relay_Log_Pos: 4167362
        Relay_Master_Log_File: mysql-bin.000008
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error:
          Exec_Master_Log_Pos: 4167229
        Seconds_Behind_Master: 6
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

在s2, 执行以下语句, 模拟s2与主库网络存在异常, 无法接受主库发来的日志

tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dst 192.168.xx.93 match ip dport 3310 0xffff flowid 1:1
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip dst 0.0.0.0/0 flowid 1:2
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip protocol 1 0xff flowid 1:2
tc qdisc add dev eth0 parent 1:1 handle 10: netem delay 180000ms
tc qdisc add dev eth0 parent 1:2 handle 20: sfq

slave2 root@localhost 20:57:31 [(none)]> pager
Default pager wasn't set, using stdout.
slave2 root@localhost 20:57:31 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave2 root@localhost 20:57:33 [(none)]> show slave status\G
              Master_Log_File: mysql-bin.000008
          Read_Master_Log_Pos: 4211217
               Relay_Log_File: mysql-relay-bin.000003
                Relay_Log_Pos: 4211350
        Relay_Master_Log_File: mysql-bin.000008
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error:
          Exec_Master_Log_Pos: 4211217
        Seconds_Behind_Master: 0
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
1 row in set (0.01 sec)

S1
              Master_Log_File: mysql-bin.000008  
          Read_Master_Log_Pos: 4274259

        Relay_Master_Log_File: mysql-bin.000008  
          Exec_Master_Log_Pos: 4167229

S2
              Master_Log_File: mysql-bin.000008  
          Read_Master_Log_Pos: 4211217

        Relay_Master_Log_File: mysql-bin.000008  
          Exec_Master_Log_Pos: 4211217

可以看到

ExecBinlogCoordinates: S2>S1
ReadBinlogCoordinates: S1>S2

查看orchestrator日志

1
2
3

2022-06-18 21:01:13 INFO Stopped replication on 192.168.xx.96:3310, Self:mysql-bin.000001:472290401, Exec:mysql-bin.000008:4167229  
2022-06-18 21:01:13 DEBUG - sorted replica: 192.168.xx.95:3310 mysql-bin.000008:4211217  
2022-06-18 21:01:13 DEBUG - sorted replica: 192.168.xx.96:3310 mysql-bin.000008:4167229

在审计日志搜索4211217和4167229

('192.168.xx.96', 3310, NOW(), NOW(), 1, 343269, 14210, '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 472290401, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407359,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-188', 'ON', '', '', 0, 0, 'mysql-bin.000008', 4274259, 'mysql-bin.000008', 4167229, 'mysql-relay-bin.000003', 4167362, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60  retries: 3\\\"', NULL, 246, 0, 0, '[]', '192.168.xx.93:3310', 'orc_dev_test', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-02_3310', 13211977, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE

('192.168.xx.95', 3310, NOW(), NOW(), 1, 343268, 57903, '64b94f66-ebe7-11ec-b57f-00163e095602', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 472227305, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '64b94f66-ebe7-11ec-b57f-00163e095602', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-188', 'ON', '', '', 0, 0, 'mysql-bin.000008', 4211217, 'mysql-bin.000008', 4211217, 'mysql-relay-bin.000003', 4211350, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 1  retries: 1\\\"', NULL, 225, 0, 0, '[]', '192.168.xx.93:3310', 'orc_dev_test', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-03_3310', 51391021, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE\n

kill -9 主库
等待日志 ERROR WaitForSQLThreadUpToDate timeout on 192.168.xx.96:3310 after duration 10s
s1解锁, 此时仍然不解除s2延迟, 等待failover完成

Failover完成, ExecBinlogCoordinates大的S2成为新主库

查看S2 Executed_Gtid_Set

slave2 root@localhost 21:02:05 [(none)]> show master status;
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                                                                       |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000001 | 472231613 |              |                  | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,
64b94f66-ebe7-11ec-b57f-00163e095602:1-200 |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

启动旧主库, 查看Executed_Gtid_Set

master root@localhost 21:02:28 [(none)]> show master status;
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                                                                       |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000009 |      274 |              |                  | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407665,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,
64b94f66-ebe7-11ec-b57f-00163e095602:1-188 |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

可以看到

旧主库 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407665
新主库(S2) 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485

数据丢失了
日志就不贴了, 没什么意义, 有兴趣可以自己尝试复现

修改后功能验证

设置MinimiseDataLoss=true, 重启orc

官方没这个参数, 这是自己改代码加的参数

在s1, 执行以下语句, 锁住t1表

slave1 root@localhost 21:16:05 [test]> begin;lock table test.t1 read;
Query OK, 0 rows affected (0.00 sec)
 
Query OK, 0 rows affected (0.00 sec)

此后, s1 由于被锁阻塞, ExecBinlogCoordinates不在变化, 但仍然可以正常接收主库发来的日志

slave1 root@localhost 21:16:21 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave1 root@localhost 21:18:08 [(none)]> show slave status\G
              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 1411494
               Relay_Log_File: mysql-relay-bin.000003
                Relay_Log_Pos: 1259657
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Connecting
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error:
          Exec_Master_Log_Pos: 1274711
        Seconds_Behind_Master: 121
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master 'repler@192.168.xx.93:3310' - retry-time: 60  retries: 1
               Last_SQL_Errno: 0
               Last_SQL_Error:
      Slave_SQL_Running_State: Waiting for dependent transaction to commit
      Last_IO_Error_Timestamp: 220618 21:17:16
     Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

在s2, 执行以下语句, 模拟s2与主库网络存在异常, 无法接受主库发来的日志

tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dst 192.168.xx.93 match ip dport 3310 0xffff flowid 1:1
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip dst 0.0.0.0/0 flowid 1:2
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip protocol 1 0xff flowid 1:2
tc qdisc add dev eth0 parent 1:1 handle 10: netem delay 180000ms
tc qdisc add dev eth0 parent 1:2 handle 20: sfq

slave2 root@localhost 21:16:45 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave2 root@localhost 21:18:15 [(none)]> show slave status\G
              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 1349240
               Relay_Log_File: mysql-relay-bin.000003
                Relay_Log_Pos: 1335560
        Relay_Master_Log_File: mysql-bin.000009
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
                   Last_Errno: 0
                   Last_Error:
          Exec_Master_Log_Pos: 1349240
        Seconds_Behind_Master: 0
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
      Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

S1

              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 1411494
        Relay_Master_Log_File: mysql-bin.000009
          Exec_Master_Log_Pos: 1274711

S2
              Master_Log_File: mysql-bin.000009
          Read_Master_Log_Pos: 1349240
        Relay_Master_Log_File: mysql-bin.000009
          Exec_Master_Log_Pos: 1349240

可以看到

ExecBinlogCoordinates: S2>S1
ReadBinlogCoordinates: S1>S2

查看orchestrator日志

1
2
3

2022-06-18 21:20:20 INFO Stopped replication on 192.168.xx.96:3310, Self:mysql-bin.000001:473561005, Exec:mysql-bin.000009:1274711
2022-06-18 21:20:20 DEBUG - sorted replica: 192.168.xx.96:3310 mysql-bin.000009:1274711
2022-06-18 21:20:20 DEBUG - sorted replica: 192.168.22.95:3310 mysql-bin.000009:1349240

在审计日志搜索1274711和1349240

('192.168.xx.96', 3310, NOW(), NOW(), 1, 344416, 14210, '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 473561005, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '635da820-ebe7-11ec-8550-fa163e6d6e43,63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411198,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-230', 'ON', '', '', 0, 0, 'mysql-bin.000009', 1411494, 'mysql-bin.000009', 1274711, 'mysql-relay-bin.000003', 1261031, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60  retries: 3\\\"', NULL, 252, 0, 0, '[]', '192.168.xx.93:3310', 'orc_test_dev', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-dev-02_3310', 7597010, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE\n

('192.168.22.95', 3310, NOW(), NOW(), 1, 344415, 57903, '64b94f66-ebe7-11ec-b57f-00163e095602', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 473527169, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '635da820-ebe7-11ec-8550-fa163e6d6e43,64b94f66-ebe7-11ec-b57f-00163e095602', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411411,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-230', 'ON', '', '', 0, 0, 'mysql-bin.000009', 1349240, 'mysql-bin.000009', 1349240, 'mysql-relay-bin.000003', 1335560, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60  retries: 1\\\"', NULL, 221, 0, 0, '[]', '192.168.xx.93:3310', 'orc_test_dev', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'hb3a-mysql-test-dev-03_3310', 75173733, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE\n

kill -9主库

等待日志 ERROR WaitForSQLThreadUpToDate timeout on 192.168.xx.96:3310 after duration 10s

s1解锁, 此时仍然不解除s2延迟, 等待failover完成

Failover完成, ReadBinlogCoordinates大的S1成为新主库

查看S1 Executed_Gtid_Set

master root@localhost 21:21:06 [(none)]> show master status;
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File             | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                                                                       |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000010 |      274 |              |                  | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,
64b94f66-ebe7-11ec-b57f-00163e095602:1-230 |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

启动旧主库, 查看Executed_Gtid_Set

slave1 root@localhost 21:20:20 [infra_orc_test]> show master status;
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File             | Position  | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set                                                                                                                       |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000001 | 473771250 |              |                  | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3991,
64b94f66-ebe7-11ec-b57f-00163e095602:1-230 |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

可以看到

旧主库 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589
新主库(S1) 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589
数据没有丢失!