server_id相同导致graceful-master-takeover-auto失败

2022-09-03 阅读量

背景

同事要做一个数据库迁移, 方法就是加从库, 然后主从切换, 再下掉老主库, 完成迁移
原本是一主两从集群, 增加三个从库变为一主五从

使用命令切换时报错(以下是本地测试复现的命令)

1
2

resources/bin/orchestrator-client -c graceful-master-takeover-auto -alias orc_prod_infra_broker -d 172.20.0.14:3306 -b 'superuser:superpass'
Desginated instance 172.20.0.14:3306 cannot take over all of its siblings. Error: 2022-09-02 15:33:27 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually

问题复现

其实还没来得及翻代码, 同事就已经发现新加的三个从库server_id与原集群提供备份节点的server_id是一样的, 所以基本猜测就是server_id导致

构建本地测试环境

这里我部署1主4从, 虽然比线上实际少了一个从库, 但不影响复现问题.
这个集群中, 我们将以下三个节点server_id都设置成101:

172.20.0.11
172.20.0.13

172.20.0.14

root@localhost 16:07:38 [orchestrator]> select instance_alias, hostname, port, server_id from database_instance;
+---------------------------+-------------+------+-----------+
| instance_alias            | hostname    | port | server_id |
+---------------------------+-------------+------+-----------+
| local_docker_master_3306  | 172.20.0.10 | 3306 |         1 |
| local_docker_slave_1_3306 | 172.20.0.11 | 3306 |       101 |
| local_docker_slave_2_3306 | 172.20.0.12 | 3306 |       102 |
| local_docker_slave_3_3306 | 172.20.0.13 | 3306 |       101 |
| local_docker_slave_4_3306 | 172.20.0.14 | 3306 |       101 |
+---------------------------+-------------+------+-----------+
5 rows in set (0.01 sec)

使用命令切换

1
2

resources/bin/orchestrator-client -c graceful-master-takeover-auto -alias orc_prod_infra_broker -d 172.20.0.14:3306 -b 'superuser:superpass'
Desginated instance 172.20.0.14:3306 cannot take over all of its siblings. Error: 2022-09-02 15:33:27 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually

日志

[martini] Started GET /api/graceful-master-takeover-auto/orc_prod_infra_broker/172.20.0.14/3306 for 127.0.0.1:51897
2022-09-02 15:33:26 INFO GracefulMasterTakeover: designated master instructed to be 172.20.0.14:3306
2022-09-02 15:33:26 INFO GracefulMasterTakeover: Will let 172.20.0.14:3306 take over its siblings
2022-09-02 15:33:26 INFO moveReplicasViaGTID: Will move 3 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
2022-09-02 15:33:26 INFO Will move 172.20.0.12:3306 below 172.20.0.14:3306 via GTID
2022-09-02 15:33:26 INFO Stopped replication on 172.20.0.12:3306, Self:mysql-bin.000001:2997164, Exec:mysql-bin.000003:4032
2022-09-02 15:33:26 DEBUG ChangeMasterTo: will attempt changing master on 172.20.0.12:3306 to 172.20.0.14:3306, mysql-bin.000001:2997164
2022-09-02 15:33:27 INFO ChangeMasterTo: Changed master on 172.20.0.12:3306 to: 172.20.0.14:3306, mysql-bin.000001:2997164. GTID: true
2022-09-02 15:33:27 DEBUG semi-sync: analysis results for recovery of cluster 172.20.0.10:3306:
2022-09-02 15:33:27 DEBUG semi-sync: master = 172.20.0.14:3306, master semi-sync wait count = 0, master semi-sync replica count = 0
2022-09-02 15:33:27 DEBUG semi-sync: possible semi-sync replicas (in priority order): (none)
2022-09-02 15:33:27 DEBUG semi-sync: always-async replicas: (none)
2022-09-02 15:33:27 DEBUG semi-sync: excluded replicas (defunct): (none)
2022-09-02 15:33:27 DEBUG semi-sync: suggested actions: (none)
2022-09-02 15:33:27 INFO semi-sync: 172.20.0.12:3306: no action taken; this may lead to future recoveries
2022-09-02 15:33:27 INFO Started replication on 172.20.0.12:3306
2022-09-02 15:33:27 INFO moveReplicasViaGTID: Will move 2 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
2022-09-02 15:33:27 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually
[martini] Completed 500 Internal Server Error in 581.39425ms
[martini] Started GET /web/cluster/alias/orc_prod_infra_broker for 127.0.0.1:51672

问题定位

通过对日志中关键字的搜索, 发现

1	Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually

以下日志只会在relocateReplicasInternal函数中中输出

那么是什么导致movereplicasViaGTID报错呢? 继续梳理代码可以发现
moveReplicasViaGTID 会调用 moveInstanceBelowViaGTID,
后者会调用 CanReplicateFrom 检查 A是否可以做B的从库

而这里就会判断server_id是否相等

原生orchestrator的问题

原生的问题是, 没有输出这段错误信息

想解决也很简单, 只修改以下两处代码

[martini] Started GET /api/graceful-master-takeover-auto/orc_prod_infra_broker/172.20.0.14/3306 for 127.0.0.1:60996
2022-09-02 17:21:00 INFO GracefulMasterTakeover: designated master instructed to be 172.20.0.14:3306
2022-09-02 17:21:00 INFO GracefulMasterTakeover: Will let 172.20.0.14:3306 take over its siblings
2022-09-02 17:21:00 INFO moveReplicasViaGTID: Will move 3 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
2022-09-02 17:21:00 INFO Will move 172.20.0.12:3306 below 172.20.0.14:3306 via GTID
2022-09-02 17:21:00 INFO Stopped replication on 172.20.0.12:3306, Self:mysql-bin.000001:2997164, Exec:mysql-bin.000003:4032
2022-09-02 17:21:00 INFO ChangeMasterTo: Changed master on 172.20.0.12:3306 to: 172.20.0.14:3306, mysql-bin.000001:2997164. GTID: true
2022-09-02 17:21:00 INFO semi-sync: 172.20.0.12:3306: no action taken; this may lead to future recoveries
2022-09-02 17:21:00 INFO Started replication on 172.20.0.12:3306
2022-09-02 17:21:00 INFO moveReplicasViaGTID: Will move 2 replicas below 172.20.0.14:3306 via GTID, max concurrency: 5
2022-09-02 17:21:00 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually
2022-09-02 17:21:00 ERROR Identical server id: 172.20.0.14:3306, 172.20.0.11:3306 both have 101
2022-09-02 17:21:00 ERROR Identical server id: 172.20.0.14:3306, 172.20.0.13:3306 both have 101
[martini] Completed 500 Internal Server Error in 402.777083ms