Orchestrator Lastest Slave选择逻辑”不合理”导致的数据丢失问题
首先说明, “不合理”只是站在个人角度的结论. 到底合理不合理, 我说了不算.
以我对Orchestrator的了解, Orchestrator目标是追求可用性优先, 而非数据完整性. 很多公司也使用了Orchestrator, 我感觉未必知道有这个问题, 或者说, 别问, 问就是”我们追求可用性”.
现在的问题是, 即便开了半同步, 也可能丢数据.
然而矛盾的点是, 线上主从的复制延迟是大家都要监控和管理的, 不会长期处于高延迟状态, 起码我经历的公司都是这样, 99.9%的集群主从延迟在1s内. 个别集群在高峰期会升高一点, 但很快又会下降; 又或者这些集群本身就是AP型业务.
那么既然我们可以保证复制延迟小于1s, 根据ReadBinlogCoordinates
选择Latest slave又能导致”恢复时间”增大多少呢? 而为了这X秒的快速恢复, 你又要花多少时间修复数据呢?
问题描述
我在2021年2月调研orchestrator时发现了这个问题, 问题的详细描述和复现方法见issue#1312, 我在这里简单描述一下.
原生orchestrator根据ExecBinlogCoordinates选择Latest Slave. 即ExecBinlogCoordinates大的为Latest Slave. 但ExecBinlogCoordinates大的副本不一定有最全的日志.
1 2 3 4 5 6
| ExecBinlogCoordinates 是应用binlog坐标的意思 对应show slave status中的 - Relay_Master_Log_File - Exec_Master_Log_Pos
表示sql_thread已经应用了主库哪个binlog哪个位置的日志.
|
假设我有一个一主两从复制集, 半同步开启, rpl_semi_sync_master_wait_for_slave_count = 1
M 是主, S1 和 S2 是从
在某个时间点:
- S1 的 ExecBinlogCoordinates < S2 的 ExecBinlogCoordinate
- S2 网络有问题. 一些binlog事件还没有写入S2的relay log,但是IO_THREAD状态还在running.
此时, 主库宕机, 原生orchestrator会选择S2作为新主库, 这会导致数据丢失
在学习orchestrator源码后, 我修改了Latest Slave的选择逻辑, 解决了上述问题, 修改内容详见fix-1312
测试环境说明
1 2 3 4 5 6 7 8 9
| master fanboshi@127.0.0.1 15:25:08 [dbms_monitor]> select * from cluster; + | id | env | cluster_name | cluster_xxx_domain | datacenter | region | hostname | port | promotion_rule | D_UPDATED_AT | D_CREATED_AT | + | 1 | dev | orc_dev_test | NULL | cn-north-1a | 123456Cloud | bj1-mysql-test-01 | 3310 | neutral | 2022-06-14 21:40:34 | 2022-06-14 21:40:34 | | 2 | dev | orc_dev_test | NULL | cn-north-1a | 123456Cloud | bj1-mysql-test-03 | 3310 | neutral | 2022-06-18 11:33:50 | 2022-06-14 21:40:34 | | 3 | dev | orc_dev_test | NULL | cn-north-1a | 123456Cloud | bj1-mysql-test-02 | 3310 | neutral | 2022-06-14 21:40:34 | 2022-06-14 21:40:34 | + 3 rows in set (0.00 sec)
|
为了方便测试, 在所有从库
1 2 3 4
| stop slave; set global slave_net_timeout = 3600; CHANGE MASTER TO MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400, MASTER_HEARTBEAT_PERIOD=1800; start slave;
|
s1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| slave1 root@localhost 15:46:17 [(none)]> show global status like '%semi%'; + | Variable_name | Value | + | Rpl_semi_sync_master_clients | 0 | | Rpl_semi_sync_master_net_avg_wait_time | 0 | | Rpl_semi_sync_master_net_wait_time | 0 | | Rpl_semi_sync_master_net_waits | 3074 | | Rpl_semi_sync_master_no_times | 3 | | Rpl_semi_sync_master_no_tx | 0 | | Rpl_semi_sync_master_status | OFF | | Rpl_semi_sync_master_timefunc_failures | 0 | | Rpl_semi_sync_master_tx_avg_wait_time | 16165 | | Rpl_semi_sync_master_tx_wait_time | 47722030 | | Rpl_semi_sync_master_tx_waits | 2952 | | Rpl_semi_sync_master_wait_pos_backtraverse | 3 | | Rpl_semi_sync_master_wait_sessions | 0 | | Rpl_semi_sync_master_yes_tx | 2950 | | Rpl_semi_sync_slave_status | ON | + 15 rows in set (0.00 sec) slave1 root@localhost 15:48:16 [(none)]> select * from mysql.slave_master_info\G *************************** 1. row *************************** Number_of_lines: 25 Master_log_name: mysql-bin.000001 Master_log_pos: 154 Host: 192.168.xx.93 User_name: repler User_password: superpass Port: 3310 Connect_retry: 1 Enabled_ssl: 0 Ssl_ca: Ssl_capath: Ssl_cert: Ssl_cipher: Ssl_key: Ssl_verify_server_cert: 0 Heartbeat: 1800 Bind: Ignored_server_ids: 0 Uuid: 635da820-ebe7-11ec-8550-fa163e6d6e43 Retry_count: 86400 Ssl_crl: Ssl_crlpath: Enabled_auto_position: 1 Channel_name: Tls_version: 1 row in set (0.00 sec)
|
s2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| slave2 root@localhost 15:46:20 [(none)]> show global status like '%semi%'; + | Variable_name | Value | + | Rpl_semi_sync_master_clients | 0 | | Rpl_semi_sync_master_net_avg_wait_time | 0 | | Rpl_semi_sync_master_net_wait_time | 0 | | Rpl_semi_sync_master_net_waits | 0 | | Rpl_semi_sync_master_no_times | 0 | | Rpl_semi_sync_master_no_tx | 0 | | Rpl_semi_sync_master_status | OFF | | Rpl_semi_sync_master_timefunc_failures | 0 | | Rpl_semi_sync_master_tx_avg_wait_time | 0 | | Rpl_semi_sync_master_tx_wait_time | 0 | | Rpl_semi_sync_master_tx_waits | 0 | | Rpl_semi_sync_master_wait_pos_backtraverse | 0 | | Rpl_semi_sync_master_wait_sessions | 0 | | Rpl_semi_sync_master_yes_tx | 0 | | Rpl_semi_sync_slave_status | ON | + 15 rows in set (0.01 sec) slave2 root@localhost 15:48:21 [(none)]> select * from mysql.slave_master_info\G *************************** 1. row *************************** Number_of_lines: 25 Master_log_name: mysql-bin.000001 Master_log_pos: 154 Host: 192.168.xx.93 User_name: repler User_password: superpass Port: 3310 Connect_retry: 1 Enabled_ssl: 0 Ssl_ca: Ssl_capath: Ssl_cert: Ssl_cipher: Ssl_key: Ssl_verify_server_cert: 0 Heartbeat: 1800 Bind: Ignored_server_ids: 0 Uuid: 635da820-ebe7-11ec-8550-fa163e6d6e43 Retry_count: 86400 Ssl_crl: Ssl_crlpath: Enabled_auto_position: 1 Channel_name: Tls_version: 1 row in set (0.00 sec)
|
主库设置
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
| set global rpl_semi_sync_master_timeout=4294967295;
master fanboshi@127.0.0.1 15:47:08 [dbms_monitor]> set global rpl_semi_sync_master_timeout=4294967295; Query OK, 0 rows affected (0.00 sec) fanboshi@127.0.0.1 15:47:30 [dbms_monitor]> show global status like '%semi%'; + | Variable_name | Value | + | Rpl_semi_sync_master_clients | 2 | | Rpl_semi_sync_master_net_avg_wait_time | 0 | | Rpl_semi_sync_master_net_wait_time | 0 | | Rpl_semi_sync_master_net_waits | 2689143 | | Rpl_semi_sync_master_no_times | 4 | | Rpl_semi_sync_master_no_tx | 1 | | Rpl_semi_sync_master_status | ON | | Rpl_semi_sync_master_timefunc_failures | 0 | | Rpl_semi_sync_master_tx_avg_wait_time | 439 | | Rpl_semi_sync_master_tx_wait_time | 592895132 | | Rpl_semi_sync_master_tx_waits | 1349976 | | Rpl_semi_sync_master_wait_pos_backtraverse | 3 | | Rpl_semi_sync_master_wait_sessions | 0 | | Rpl_semi_sync_master_yes_tx | 1349975 | | Rpl_semi_sync_slave_status | OFF | + 15 rows in set (0.00 sec) master fanboshi@127.0.0.1 15:47:58 [dbms_monitor]> show global variables like '%semi%'; + | Variable_name | Value | + | rpl_semi_sync_master_enabled | ON | | rpl_semi_sync_master_timeout | 4294967295 | | rpl_semi_sync_master_trace_level | 32 | | rpl_semi_sync_master_wait_for_slave_count | 1 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_master_wait_point | AFTER_SYNC | | rpl_semi_sync_slave_enabled | ON | | rpl_semi_sync_slave_trace_level | 32 | + 8 rows in set (0.00 sec)
|
通过一个脚本模拟数据写入test.t1表
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| cat insert.py
import time import pymysql conn = pymysql.connect(host='192.168.xx.93', port=16033, user='dbms_monitor_rw', password='superpass', db='test', connect_timeout=1, read_timeout=1, write_timeout=1, autocommit=True, charset='utf8') cursor = conn.cursor(pymysql.cursors.DictCursor) if __name__ == '__main__': while True: time.sleep(0.5) try: conn.ping(reconnect=True) cursor.execute("replace into t1(id, ctime) values(1, now());") except pymysql.Error as e: sql = self.cursor.mogrify(sql) print(sql)
|
原生orchestrator问题复现
在s1, 执行以下语句, 锁住t1表
1
| slave1 root@localhost 20:57:04 [test]> begin;lock table test.t1 read;
|
此后, s1 由于被锁阻塞, ExecBinlogCoordinates不在变化, 但仍然可以正常接收主库发来的日志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| slave1 root@localhost 20:57:12 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos' PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'' slave1 root@localhost 20:57:12 [(none)]> show slave status\G Master_Log_File: mysql-bin.000008 Read_Master_Log_Pos: 4274259 Relay_Log_File: mysql-relay-bin.000003 Relay_Log_Pos: 4167362 Relay_Master_Log_File: mysql-bin.000008 Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: Exec_Master_Log_Pos: 4167229 Seconds_Behind_Master: 6 Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Slave_SQL_Running_State: Waiting for dependent transaction to commit Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 1 row in set (0.00 sec)
|
在s2, 执行以下语句, 模拟s2与主库网络存在异常, 无法接受主库发来的日志
1 2 3 4 5 6 7
| tc qdisc del dev eth0 root tc qdisc add dev eth0 root handle 1: prio tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dst 192.168.xx.93 match ip dport 3310 0xffff flowid 1:1 tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip dst 0.0.0.0/0 flowid 1:2 tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip protocol 1 0xff flowid 1:2 tc qdisc add dev eth0 parent 1:1 handle 10: netem delay 180000ms tc qdisc add dev eth0 parent 1:2 handle 20: sfq
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| slave2 root@localhost 20:57:31 [(none)]> pager Default pager wasn't set, using stdout. slave2 root@localhost 20:57:31 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos' PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'' slave2 root@localhost 20:57:33 [(none)]> show slave status\G Master_Log_File: mysql-bin.000008 Read_Master_Log_Pos: 4211217 Relay_Log_File: mysql-relay-bin.000003 Relay_Log_Pos: 4211350 Relay_Master_Log_File: mysql-bin.000008 Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: Exec_Master_Log_Pos: 4211217 Seconds_Behind_Master: 0 Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 1 row in set (0.01 sec)
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| S1 Master_Log_File: mysql-bin.000008 Read_Master_Log_Pos: 4274259
Relay_Master_Log_File: mysql-bin.000008 Exec_Master_Log_Pos: 4167229
S2 Master_Log_File: mysql-bin.000008 Read_Master_Log_Pos: 4211217
Relay_Master_Log_File: mysql-bin.000008 Exec_Master_Log_Pos: 4211217
|
可以看到
- ExecBinlogCoordinates: S2>S1
- ReadBinlogCoordinates: S1>S2
查看orchestrator日志
1 2 3
| 2022-06-18 21:01:13 INFO Stopped replication on 192.168.xx.96:3310, Self:mysql-bin.000001:472290401, Exec:mysql-bin.000008:4167229 2022-06-18 21:01:13 DEBUG - sorted replica: 192.168.xx.95:3310 mysql-bin.000008:4211217 2022-06-18 21:01:13 DEBUG - sorted replica: 192.168.xx.96:3310 mysql-bin.000008:4167229
|
在审计日志搜索4211217和4167229
s1
1
| ('192.168.xx.96', 3310, NOW(), NOW(), 1, 343269, 14210, '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 472290401, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407359,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-188', 'ON', '', '', 0, 0, 'mysql-bin.000008', 4274259, 'mysql-bin.000008', 4167229, 'mysql-relay-bin.000003', 4167362, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60 retries: 3\\\"', NULL, 246, 0, 0, '[]', '192.168.xx.93:3310', 'orc_dev_test', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-02_3310', 13211977, '', 0, '', '', '[]', '', 0, NOW())\n ON DUPLICATE KEY UPDATE
|
s2
1
| ('192.168.xx.95', 3310, NOW(), NOW(), 1, 343268, 57903, '64b94f66-ebe7-11ec-b57f-00163e095602', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 472227305, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '64b94f66-ebe7-11ec-b57f-00163e095602', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-188', 'ON', '', '', 0, 0, 'mysql-bin.000008', 4211217, 'mysql-bin.000008', 4211217, 'mysql-relay-bin.000003', 4211350, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 1 retries: 1\\\"', NULL, 225, 0, 0, '[]', '192.168.xx.93:3310', 'orc_dev_test', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-03_3310', 51391021, '', 0, '', '', '[]', '', 0, NOW())\n ON DUPLICATE KEY UPDATE\n
|
kill -9 主库
等待日志 ERROR WaitForSQLThreadUpToDate timeout on 192.168.xx.96:3310 after duration 10s
s1解锁, 此时仍然不解除s2延迟, 等待failover完成
Failover完成, ExecBinlogCoordinates大的S2成为新主库
查看S2 Executed_Gtid_Set
1 2 3 4 5 6 7 8 9
| slave2 root@localhost 21:02:05 [(none)]> show master status; + | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | + | mysql-bin.000001 | 472231613 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485, 63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756, 64b94f66-ebe7-11ec-b57f-00163e095602:1-200 | + 1 row in set (0.00 sec)
|
启动旧主库, 查看Executed_Gtid_Set
1 2 3 4 5 6 7 8 9
| master root@localhost 21:02:28 [(none)]> show master status; + | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | + | mysql-bin.000009 | 274 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407665, 63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756, 64b94f66-ebe7-11ec-b57f-00163e095602:1-188 | + 1 row in set (0.00 sec)
|
可以看到
- 旧主库 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407665
- 新主库(S2) 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485
数据丢失了
日志就不贴了, 没什么意义, 有兴趣可以自己尝试复现
修改后功能验证
设置MinimiseDataLoss=true
, 重启orc
官方没这个参数, 这是自己改代码加的参数
在s1, 执行以下语句, 锁住t1表
1 2 3 4
| slave1 root@localhost 21:16:05 [test]> begin;lock table test.t1 read; Query OK, 0 rows affected (0.00 sec) Query OK, 0 rows affected (0.00 sec)
|
此后, s1 由于被锁阻塞, ExecBinlogCoordinates不在变化, 但仍然可以正常接收主库发来的日志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| slave1 root@localhost 21:16:21 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos' PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'' slave1 root@localhost 21:18:08 [(none)]> show slave status\G Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 1411494 Relay_Log_File: mysql-relay-bin.000003 Relay_Log_Pos: 1259657 Relay_Master_Log_File: mysql-bin.000009 Slave_IO_Running: Connecting Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: Exec_Master_Log_Pos: 1274711 Seconds_Behind_Master: 121 Last_IO_Errno: 2003 Last_IO_Error: error reconnecting to master 'repler@192.168.xx.93:3310' - retry-time: 60 retries: 1 Last_SQL_Errno: 0 Last_SQL_Error: Slave_SQL_Running_State: Waiting for dependent transaction to commit Last_IO_Error_Timestamp: 220618 21:17:16 Last_SQL_Error_Timestamp: 1 row in set (0.00 sec)
|
在s2, 执行以下语句, 模拟s2与主库网络存在异常, 无法接受主库发来的日志
1 2 3 4 5 6 7
| tc qdisc del dev eth0 root tc qdisc add dev eth0 root handle 1: prio tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dst 192.168.xx.93 match ip dport 3310 0xffff flowid 1:1 tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip dst 0.0.0.0/0 flowid 1:2 tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip protocol 1 0xff flowid 1:2 tc qdisc add dev eth0 parent 1:1 handle 10: netem delay 180000ms tc qdisc add dev eth0 parent 1:2 handle 20: sfq
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| slave2 root@localhost 21:16:45 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos' PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'' slave2 root@localhost 21:18:15 [(none)]> show slave status\G Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 1349240 Relay_Log_File: mysql-relay-bin.000003 Relay_Log_Pos: 1335560 Relay_Master_Log_File: mysql-bin.000009 Slave_IO_Running: Yes Slave_SQL_Running: Yes Last_Errno: 0 Last_Error: Exec_Master_Log_Pos: 1349240 Seconds_Behind_Master: 0 Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: 1 row in set (0.00 sec)
|
1 2 3 4 5 6 7 8 9 10 11 12
| S1
Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 1411494 Relay_Master_Log_File: mysql-bin.000009 Exec_Master_Log_Pos: 1274711
S2 Master_Log_File: mysql-bin.000009 Read_Master_Log_Pos: 1349240 Relay_Master_Log_File: mysql-bin.000009 Exec_Master_Log_Pos: 1349240
|
可以看到
- ExecBinlogCoordinates: S2>S1
- ReadBinlogCoordinates: S1>S2
查看orchestrator日志
1 2 3
| 2022-06-18 21:20:20 INFO Stopped replication on 192.168.xx.96:3310, Self:mysql-bin.000001:473561005, Exec:mysql-bin.000009:1274711 2022-06-18 21:20:20 DEBUG - sorted replica: 192.168.xx.96:3310 mysql-bin.000009:1274711 2022-06-18 21:20:20 DEBUG - sorted replica: 192.168.22.95:3310 mysql-bin.000009:1349240
|
在审计日志搜索1274711和1349240
S1
1
| ('192.168.xx.96', 3310, NOW(), NOW(), 1, 344416, 14210, '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 473561005, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '635da820-ebe7-11ec-8550-fa163e6d6e43,63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411198,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-230', 'ON', '', '', 0, 0, 'mysql-bin.000009', 1411494, 'mysql-bin.000009', 1274711, 'mysql-relay-bin.000003', 1261031, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60 retries: 3\\\"', NULL, 252, 0, 0, '[]', '192.168.xx.93:3310', 'orc_test_dev', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-dev-02_3310', 7597010, '', 0, '', '', '[]', '', 0, NOW())\n ON DUPLICATE KEY UPDATE\n
|
S2
1
| ('192.168.22.95', 3310, NOW(), NOW(), 1, 344415, 57903, '64b94f66-ebe7-11ec-b57f-00163e095602', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 473527169, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '635da820-ebe7-11ec-8550-fa163e6d6e43,64b94f66-ebe7-11ec-b57f-00163e095602', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411411,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-230', 'ON', '', '', 0, 0, 'mysql-bin.000009', 1349240, 'mysql-bin.000009', 1349240, 'mysql-relay-bin.000003', 1335560, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60 retries: 1\\\"', NULL, 221, 0, 0, '[]', '192.168.xx.93:3310', 'orc_test_dev', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'hb3a-mysql-test-dev-03_3310', 75173733, '', 0, '', '', '[]', '', 0, NOW())\n ON DUPLICATE KEY UPDATE\n
|
kill -9主库
等待日志 ERROR WaitForSQLThreadUpToDate timeout on 192.168.xx.96:3310 after duration 10s
s1解锁, 此时仍然不解除s2延迟, 等待failover完成
Failover完成, ReadBinlogCoordinates大的S1成为新主库
查看S1 Executed_Gtid_Set
1 2 3 4 5 6 7 8 9
| master root@localhost 21:21:06 [(none)]> show master status; + | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | + | mysql-bin.000010 | 274 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589, 63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756, 64b94f66-ebe7-11ec-b57f-00163e095602:1-230 | + 1 row in set (0.00 sec)
|
启动旧主库, 查看Executed_Gtid_Set
1 2 3 4 5 6 7 8 9
| slave1 root@localhost 21:20:20 [infra_orc_test]> show master status; + | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | + | mysql-bin.000001 | 473771250 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589, 63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3991, 64b94f66-ebe7-11ec-b57f-00163e095602:1-230 | + 1 row in set (0.00 sec)
|
可以看到
- 旧主库 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589
- 新主库(S1) 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589
数据没有丢失!