Orchestrator Lastest Slave选择逻辑"不合理"导致的数据丢失问题

Orchestrator Lastest Slave选择逻辑”不合理”导致的数据丢失问题

首先说明, “不合理”只是站在个人角度的结论. 到底合理不合理, 我说了不算.
以我对Orchestrator的了解, Orchestrator目标是追求可用性优先, 而非数据完整性. 很多公司也使用了Orchestrator, 我感觉未必知道有这个问题, 或者说, 别问, 问就是”我们追求可用性”.

现在的问题是, 即便开了半同步, 也可能丢数据.

然而矛盾的点是, 线上主从的复制延迟是大家都要监控和管理的, 不会长期处于高延迟状态, 起码我经历的公司都是这样, 99.9%的集群主从延迟在1s内. 个别集群在高峰期会升高一点, 但很快又会下降; 又或者这些集群本身就是AP型业务.

那么既然我们可以保证复制延迟小于1s, 根据ReadBinlogCoordinates选择Latest slave又能导致”恢复时间”增大多少呢? 而为了这X秒的快速恢复, 你又要花多少时间修复数据呢?

问题描述

我在2021年2月调研orchestrator时发现了这个问题, 问题的详细描述和复现方法见issue#1312, 我在这里简单描述一下.
原生orchestrator根据ExecBinlogCoordinates选择Latest Slave. 即ExecBinlogCoordinates大的为Latest Slave. 但ExecBinlogCoordinates大的副本不一定有最全的日志.

1
2
3
4
5
6
ExecBinlogCoordinates 是应用binlog坐标的意思
对应show slave status中的
- Relay_Master_Log_File
- Exec_Master_Log_Pos

表示sql_thread已经应用了主库哪个binlog哪个位置的日志.

假设我有一个一主两从复制集, 半同步开启, rpl_semi_sync_master_wait_for_slave_count = 1
M 是主, S1 和 S2 是从

在某个时间点:

  • S1 的 ExecBinlogCoordinates < S2 的 ExecBinlogCoordinate
  • S2 网络有问题. 一些binlog事件还没有写入S2的relay log,但是IO_THREAD状态还在running.

此时, 主库宕机, 原生orchestrator会选择S2作为新主库, 这会导致数据丢失

在学习orchestrator源码后, 我修改了Latest Slave的选择逻辑, 解决了上述问题, 修改内容详见fix-1312

测试环境说明

1
2
3
4
5
6
7
8
9
master fanboshi@127.0.0.1 15:25:08 [dbms_monitor]> select * from cluster;
+----+-----------+--------------+--------------------+-------------+-------------+--------------------+------+----------------+---------------------+---------------------+
| id | env | cluster_name | cluster_xxx_domain | datacenter | region | hostname | port | promotion_rule | D_UPDATED_AT | D_CREATED_AT |
+----+-----------+--------------+--------------------+-------------+-------------+--------------------+------+----------------+---------------------+---------------------+
| 1 | dev | orc_dev_test | NULL | cn-north-1a | 123456Cloud | bj1-mysql-test-01 | 3310 | neutral | 2022-06-14 21:40:34 | 2022-06-14 21:40:34 |
| 2 | dev | orc_dev_test | NULL | cn-north-1a | 123456Cloud | bj1-mysql-test-03 | 3310 | neutral | 2022-06-18 11:33:50 | 2022-06-14 21:40:34 |
| 3 | dev | orc_dev_test | NULL | cn-north-1a | 123456Cloud | bj1-mysql-test-02 | 3310 | neutral | 2022-06-14 21:40:34 | 2022-06-14 21:40:34 |
+----+-----------+--------------+--------------------+-------------+-------------+--------------------+------+----------------+---------------------+---------------------+
3 rows in set (0.00 sec)

为了方便测试, 在所有从库

1
2
3
4
stop slave;
set global slave_net_timeout = 3600;
CHANGE MASTER TO MASTER_CONNECT_RETRY=1, MASTER_RETRY_COUNT=86400, MASTER_HEARTBEAT_PERIOD=1800;
start slave;

s1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
slave1 root@localhost 15:46:17 [(none)]> show global status like '%semi%';
+--------------------------------------------+----------+
| Variable_name | Value |
+--------------------------------------------+----------+
| Rpl_semi_sync_master_clients | 0 |
| Rpl_semi_sync_master_net_avg_wait_time | 0 |
| Rpl_semi_sync_master_net_wait_time | 0 |
| Rpl_semi_sync_master_net_waits | 3074 |
| Rpl_semi_sync_master_no_times | 3 |
| Rpl_semi_sync_master_no_tx | 0 |
| Rpl_semi_sync_master_status | OFF |
| Rpl_semi_sync_master_timefunc_failures | 0 |
| Rpl_semi_sync_master_tx_avg_wait_time | 16165 |
| Rpl_semi_sync_master_tx_wait_time | 47722030 |
| Rpl_semi_sync_master_tx_waits | 2952 |
| Rpl_semi_sync_master_wait_pos_backtraverse | 3 |
| Rpl_semi_sync_master_wait_sessions | 0 |
| Rpl_semi_sync_master_yes_tx | 2950 |
| Rpl_semi_sync_slave_status | ON |
+--------------------------------------------+----------+
15 rows in set (0.00 sec)

slave1 root@localhost 15:48:16 [(none)]> select * from mysql.slave_master_info\G
*************************** 1. row ***************************
Number_of_lines: 25
Master_log_name: mysql-bin.000001
Master_log_pos: 154
Host: 192.168.xx.93
User_name: repler
User_password: superpass
Port: 3310
Connect_retry: 1
Enabled_ssl: 0
Ssl_ca:
Ssl_capath:
Ssl_cert:
Ssl_cipher:
Ssl_key:
Ssl_verify_server_cert: 0
Heartbeat: 1800
Bind:
Ignored_server_ids: 0
Uuid: 635da820-ebe7-11ec-8550-fa163e6d6e43
Retry_count: 86400
Ssl_crl:
Ssl_crlpath:
Enabled_auto_position: 1
Channel_name:
Tls_version:
1 row in set (0.00 sec)

s2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
slave2 root@localhost 15:46:20 [(none)]> show global status like '%semi%';
+--------------------------------------------+-------+
| Variable_name | Value |
+--------------------------------------------+-------+
| Rpl_semi_sync_master_clients | 0 |
| Rpl_semi_sync_master_net_avg_wait_time | 0 |
| Rpl_semi_sync_master_net_wait_time | 0 |
| Rpl_semi_sync_master_net_waits | 0 |
| Rpl_semi_sync_master_no_times | 0 |
| Rpl_semi_sync_master_no_tx | 0 |
| Rpl_semi_sync_master_status | OFF |
| Rpl_semi_sync_master_timefunc_failures | 0 |
| Rpl_semi_sync_master_tx_avg_wait_time | 0 |
| Rpl_semi_sync_master_tx_wait_time | 0 |
| Rpl_semi_sync_master_tx_waits | 0 |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0 |
| Rpl_semi_sync_master_wait_sessions | 0 |
| Rpl_semi_sync_master_yes_tx | 0 |
| Rpl_semi_sync_slave_status | ON |
+--------------------------------------------+-------+
15 rows in set (0.01 sec)

slave2 root@localhost 15:48:21 [(none)]> select * from mysql.slave_master_info\G
*************************** 1. row ***************************
Number_of_lines: 25
Master_log_name: mysql-bin.000001
Master_log_pos: 154
Host: 192.168.xx.93
User_name: repler
User_password: superpass
Port: 3310
Connect_retry: 1
Enabled_ssl: 0
Ssl_ca:
Ssl_capath:
Ssl_cert:
Ssl_cipher:
Ssl_key:
Ssl_verify_server_cert: 0
Heartbeat: 1800
Bind:
Ignored_server_ids: 0
Uuid: 635da820-ebe7-11ec-8550-fa163e6d6e43
Retry_count: 86400
Ssl_crl:
Ssl_crlpath:
Enabled_auto_position: 1
Channel_name:
Tls_version:
1 row in set (0.00 sec)

主库设置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
set global rpl_semi_sync_master_timeout=4294967295;

master fanboshi@127.0.0.1 15:47:08 [dbms_monitor]> set global rpl_semi_sync_master_timeout=4294967295;
Query OK, 0 rows affected (0.00 sec)

fanboshi@127.0.0.1 15:47:30 [dbms_monitor]> show global status like '%semi%';
+--------------------------------------------+-----------+
| Variable_name | Value |
+--------------------------------------------+-----------+
| Rpl_semi_sync_master_clients | 2 |
| Rpl_semi_sync_master_net_avg_wait_time | 0 |
| Rpl_semi_sync_master_net_wait_time | 0 |
| Rpl_semi_sync_master_net_waits | 2689143 |
| Rpl_semi_sync_master_no_times | 4 |
| Rpl_semi_sync_master_no_tx | 1 |
| Rpl_semi_sync_master_status | ON |
| Rpl_semi_sync_master_timefunc_failures | 0 |
| Rpl_semi_sync_master_tx_avg_wait_time | 439 |
| Rpl_semi_sync_master_tx_wait_time | 592895132 |
| Rpl_semi_sync_master_tx_waits | 1349976 |
| Rpl_semi_sync_master_wait_pos_backtraverse | 3 |
| Rpl_semi_sync_master_wait_sessions | 0 |
| Rpl_semi_sync_master_yes_tx | 1349975 |
| Rpl_semi_sync_slave_status | OFF |
+--------------------------------------------+-----------+
15 rows in set (0.00 sec)

master fanboshi@127.0.0.1 15:47:58 [dbms_monitor]> show global variables like '%semi%';
+-------------------------------------------+------------+
| Variable_name | Value |
+-------------------------------------------+------------+
| rpl_semi_sync_master_enabled | ON |
| rpl_semi_sync_master_timeout | 4294967295 |
| rpl_semi_sync_master_trace_level | 32 |
| rpl_semi_sync_master_wait_for_slave_count | 1 |
| rpl_semi_sync_master_wait_no_slave | ON |
| rpl_semi_sync_master_wait_point | AFTER_SYNC |
| rpl_semi_sync_slave_enabled | ON |
| rpl_semi_sync_slave_trace_level | 32 |
+-------------------------------------------+------------+
8 rows in set (0.00 sec)

通过一个脚本模拟数据写入test.t1表

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
cat insert.py
#!/root/.pyenv/versions/myrecover/bin/python
# -*- coding: utf8 -*-
# __author__ = 'Fan()'
# Date: 2022-06-01

import time
import pymysql

conn = pymysql.connect(host='192.168.xx.93', port=16033, user='dbms_monitor_rw', password='superpass',
db='test', connect_timeout=1, read_timeout=1, write_timeout=1,
autocommit=True, charset='utf8')
cursor = conn.cursor(pymysql.cursors.DictCursor)

if __name__ == '__main__':
while True:
time.sleep(0.5)
try:
conn.ping(reconnect=True)
cursor.execute("replace into t1(id, ctime) values(1, now());")
except pymysql.Error as e:
sql = self.cursor.mogrify(sql)
print(sql)

原生orchestrator问题复现

在s1, 执行以下语句, 锁住t1表

1
slave1 root@localhost 20:57:04 [test]> begin;lock table test.t1 read;

此后, s1 由于被锁阻塞, ExecBinlogCoordinates不在变化, 但仍然可以正常接收主库发来的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
slave1 root@localhost 20:57:12 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave1 root@localhost 20:57:12 [(none)]> show slave status\G
Master_Log_File: mysql-bin.000008
Read_Master_Log_Pos: 4274259
Relay_Log_File: mysql-relay-bin.000003
Relay_Log_Pos: 4167362
Relay_Master_Log_File: mysql-bin.000008
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 4167229
Seconds_Behind_Master: 6
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

在s2, 执行以下语句, 模拟s2与主库网络存在异常, 无法接受主库发来的日志

1
2
3
4
5
6
7
tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dst 192.168.xx.93 match ip dport 3310 0xffff flowid 1:1
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip dst 0.0.0.0/0 flowid 1:2
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip protocol 1 0xff flowid 1:2
tc qdisc add dev eth0 parent 1:1 handle 10: netem delay 180000ms
tc qdisc add dev eth0 parent 1:2 handle 20: sfq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
slave2 root@localhost 20:57:31 [(none)]> pager
Default pager wasn't set, using stdout.
slave2 root@localhost 20:57:31 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave2 root@localhost 20:57:33 [(none)]> show slave status\G
Master_Log_File: mysql-bin.000008
Read_Master_Log_Pos: 4211217
Relay_Log_File: mysql-relay-bin.000003
Relay_Log_Pos: 4211350
Relay_Master_Log_File: mysql-bin.000008
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 4211217
Seconds_Behind_Master: 0
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
1 row in set (0.01 sec)
1
2
3
4
5
6
7
8
9
10
11
12
13
S1
Master_Log_File: mysql-bin.000008
Read_Master_Log_Pos: 4274259

Relay_Master_Log_File: mysql-bin.000008
Exec_Master_Log_Pos: 4167229

S2
Master_Log_File: mysql-bin.000008
Read_Master_Log_Pos: 4211217

Relay_Master_Log_File: mysql-bin.000008
Exec_Master_Log_Pos: 4211217

可以看到

  • ExecBinlogCoordinates: S2>S1
  • ReadBinlogCoordinates: S1>S2

查看orchestrator日志

1
2
3
2022-06-18 21:01:13 INFO Stopped replication on 192.168.xx.96:3310, Self:mysql-bin.000001:472290401, Exec:mysql-bin.000008:4167229  
2022-06-18 21:01:13 DEBUG - sorted replica: 192.168.xx.95:3310 mysql-bin.000008:4211217
2022-06-18 21:01:13 DEBUG - sorted replica: 192.168.xx.96:3310 mysql-bin.000008:4167229

在审计日志搜索4211217和4167229

s1

1
('192.168.xx.96', 3310, NOW(), NOW(), 1, 343269, 14210, '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 472290401, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407359,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-188', 'ON', '', '', 0, 0, 'mysql-bin.000008', 4274259, 'mysql-bin.000008', 4167229, 'mysql-relay-bin.000003', 4167362, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60  retries: 3\\\"', NULL, 246, 0, 0, '[]', '192.168.xx.93:3310', 'orc_dev_test', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-02_3310', 13211977, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE

s2

1
('192.168.xx.95', 3310, NOW(), NOW(), 1, 343268, 57903, '64b94f66-ebe7-11ec-b57f-00163e095602', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 472227305, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '64b94f66-ebe7-11ec-b57f-00163e095602', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-188', 'ON', '', '', 0, 0, 'mysql-bin.000008', 4211217, 'mysql-bin.000008', 4211217, 'mysql-relay-bin.000003', 4211350, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 1  retries: 1\\\"', NULL, 225, 0, 0, '[]', '192.168.xx.93:3310', 'orc_dev_test', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-03_3310', 51391021, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE\n

kill -9 主库
等待日志 ERROR WaitForSQLThreadUpToDate timeout on 192.168.xx.96:3310 after duration 10s
s1解锁, 此时仍然不解除s2延迟, 等待failover完成

Failover完成, ExecBinlogCoordinates大的S2成为新主库

查看S2 Executed_Gtid_Set

1
2
3
4
5
6
7
8
9
slave2 root@localhost 21:02:05 [(none)]> show master status;
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000001 | 472231613 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,
64b94f66-ebe7-11ec-b57f-00163e095602:1-200 |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

启动旧主库, 查看Executed_Gtid_Set

1
2
3
4
5
6
7
8
9
master root@localhost 21:02:28 [(none)]> show master status;
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000009 | 274 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407665,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,
64b94f66-ebe7-11ec-b57f-00163e095602:1-188 |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

可以看到

  • 旧主库 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407665
  • 新主库(S2) 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1407485

数据丢失了
日志就不贴了, 没什么意义, 有兴趣可以自己尝试复现

修改后功能验证

设置MinimiseDataLoss=true, 重启orc

官方没这个参数, 这是自己改代码加的参数

在s1, 执行以下语句, 锁住t1表

1
2
3
4
slave1 root@localhost 21:16:05 [test]> begin;lock table test.t1 read;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

此后, s1 由于被锁阻塞, ExecBinlogCoordinates不在变化, 但仍然可以正常接收主库发来的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
slave1 root@localhost 21:16:21 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave1 root@localhost 21:18:08 [(none)]> show slave status\G
Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 1411494
Relay_Log_File: mysql-relay-bin.000003
Relay_Log_Pos: 1259657
Relay_Master_Log_File: mysql-bin.000009
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 1274711
Seconds_Behind_Master: 121
Last_IO_Errno: 2003
Last_IO_Error: error reconnecting to master 'repler@192.168.xx.93:3310' - retry-time: 60 retries: 1
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Last_IO_Error_Timestamp: 220618 21:17:16
Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

在s2, 执行以下语句, 模拟s2与主库网络存在异常, 无法接受主库发来的日志

1
2
3
4
5
6
7
tc qdisc del dev eth0 root
tc qdisc add dev eth0 root handle 1: prio
tc filter add dev eth0 protocol ip parent 1: prio 1 u32 match ip dst 192.168.xx.93 match ip dport 3310 0xffff flowid 1:1
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip dst 0.0.0.0/0 flowid 1:2
tc filter add dev eth0 protocol all parent 1: prio 2 u32 match ip protocol 1 0xff flowid 1:2
tc qdisc add dev eth0 parent 1:1 handle 10: netem delay 180000ms
tc qdisc add dev eth0 parent 1:2 handle 20: sfq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
slave2 root@localhost 21:16:45 [(none)]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Seconds_Behind_Master|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
slave2 root@localhost 21:18:15 [(none)]> show slave status\G
Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 1349240
Relay_Log_File: mysql-relay-bin.000003
Relay_Log_Pos: 1335560
Relay_Master_Log_File: mysql-bin.000009
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 1349240
Seconds_Behind_Master: 0
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)
1
2
3
4
5
6
7
8
9
10
11
12
S1

Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 1411494
Relay_Master_Log_File: mysql-bin.000009
Exec_Master_Log_Pos: 1274711

S2
Master_Log_File: mysql-bin.000009
Read_Master_Log_Pos: 1349240
Relay_Master_Log_File: mysql-bin.000009
Exec_Master_Log_Pos: 1349240

可以看到

  • ExecBinlogCoordinates: S2>S1
  • ReadBinlogCoordinates: S1>S2

查看orchestrator日志

1
2
3
2022-06-18 21:20:20 INFO Stopped replication on 192.168.xx.96:3310, Self:mysql-bin.000001:473561005, Exec:mysql-bin.000009:1274711
2022-06-18 21:20:20 DEBUG - sorted replica: 192.168.xx.96:3310 mysql-bin.000009:1274711
2022-06-18 21:20:20 DEBUG - sorted replica: 192.168.22.95:3310 mysql-bin.000009:1349240

在审计日志搜索1274711和1349240

S1

1
('192.168.xx.96', 3310, NOW(), NOW(), 1, 344416, 14210, '63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 473561005, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '635da820-ebe7-11ec-8550-fa163e6d6e43,63648d0d-ebe7-11ec-b9ee-fa163eda3a6e', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411198,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-230', 'ON', '', '', 0, 0, 'mysql-bin.000009', 1411494, 'mysql-bin.000009', 1274711, 'mysql-relay-bin.000003', 1261031, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60  retries: 3\\\"', NULL, 252, 0, 0, '[]', '192.168.xx.93:3310', 'orc_test_dev', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'bj1-mysql-test-dev-02_3310', 7597010, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE\n

S2

1
('192.168.22.95', 3310, NOW(), NOW(), 1, 344415, 57903, '64b94f66-ebe7-11ec-b57f-00163e095602', '5.7.32-35-log', '5.7', 'Percona Server (GPL), Release 35, Revision 5688520', 0, 1, 'ROW', 'FULL', 1, 1, 'mysql-bin.000001', 473527169, '192.168.xx.93', 3310, 0, 0, 0, 0, 0, 1, 1, '635da820-ebe7-11ec-8550-fa163e6d6e43', '635da820-ebe7-11ec-8550-fa163e6d6e43,64b94f66-ebe7-11ec-b57f-00163e095602', '635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411411,\\n63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,\\n64b94f66-ebe7-11ec-b57f-00163e095602:1-230', 'ON', '', '', 0, 0, 'mysql-bin.000009', 1349240, 'mysql-bin.000009', 1349240, 'mysql-relay-bin.000003', 1335560, '', '\\\"error reconnecting to master \\'repler@192.168.xx.93:3310\\' - retry-time: 60  retries: 1\\\"', NULL, 221, 0, 0, '[]', '192.168.xx.93:3310', 'orc_test_dev', 'cn-north-1a', '123456Cloud', 'dev', 1, 0, 1, 1, 0, 1, 1, 0, 10000, 1, 1, 0, 0, 0, 'hb3a-mysql-test-dev-03_3310', 75173733, '', 0, '', '', '[]', '', 0, NOW())\n        ON DUPLICATE KEY UPDATE\n

kill -9主库

等待日志 ERROR WaitForSQLThreadUpToDate timeout on 192.168.xx.96:3310 after duration 10s

s1解锁, 此时仍然不解除s2延迟, 等待failover完成

Failover完成, ReadBinlogCoordinates大的S1成为新主库

查看S1 Executed_Gtid_Set

1
2
3
4
5
6
7
8
9
master root@localhost 21:21:06 [(none)]> show master status;
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000010 | 274 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3756,
64b94f66-ebe7-11ec-b57f-00163e095602:1-230 |
+------------------+----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

启动旧主库, 查看Executed_Gtid_Set

1
2
3
4
5
6
7
8
9
slave1 root@localhost 21:20:20 [infra_orc_test]> show master status;
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
| mysql-bin.000001 | 473771250 | | | 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589,
63648d0d-ebe7-11ec-b9ee-fa163eda3a6e:1-3991,
64b94f66-ebe7-11ec-b57f-00163e095602:1-230 |
+------------------+-----------+--------------+------------------+-----------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

可以看到

  • 旧主库 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589
  • 新主库(S1) 635da820-ebe7-11ec-8550-fa163e6d6e43:1-1411589
    数据没有丢失!

Powered by Hexo and Hexo-theme-hiker

Copyright © 2013 - 2022 Fan() All Rights Reserved.

访客数 : | 访问量 :