MHA Failover测试

TL;DR

用例 ping_type=CONNECT ping_type=INSERT
master too many connection 不会触发failover 不会触发failover
master hang 不会触发failover 会触发failover且成功
仅manager无法连通master 不会触发failover 不会触发failover
manager无法连通master, 且无法ssh slave1 不会触发failover 不会触发failover
manager无法连通master, 且无法ssh slave1和slave2 不会触发failover 不会触发failover
manager无法连通master, ssh到slave1后无法连通master 不会触发failover 不会触发failover
manager无法连通master, ssh到slave1和slave2后均无法连通master 会触发failover且成功 会触发failover且成功(长连接断开后才会)
master宕机前slave1也宕机了 会触发failover, 但failover失败 会触发failover, 但failover失败
master挂了, 在此之前slave-1 io_thread stop了 会failover且成功 会failover且成功
master挂了, 在此之前slave-1 io_thread error了 会failover且成功 会failover且成功
master挂了, 在此之前slave-1 sql_thread stop了 会failover且成功 会failover且成功
master挂了, 在此之前slave-1 sql_thread error了 会触发failover, 但failover失败 会触发failover, 但failover失败

环境信息

1
2
3
4
master: 172.16.120.10 centos-1 主 + proxysql
slave1: 172.16.120.11 centos-2 从 + proxysql
slave2: 172.16.120.12 centos-3 从 + proxysql
172.16.120.13 centos-4 mha manager

MHA配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
#cat /etc/masterha/conf/masterha_default.cnf 
[server default]
# mysql user and password,此处的密码不能加引号
user=mha
password=xxxx

#replication_user
repl_user=repler
repl_password=xxxx

#checking master every 3 second
ping_interval=3

# 使用短连接检测,默认是长连接
ping_type=INSERT
#ping_type=CONNECT
#下面会测试两种type

#ssh user
ssh_user=root

#发送邮件脚本
report_script=/etc/masterha/scripts/send_report

# 节点工作目录
remote_workdir=/masterha/


#cat /etc/masterha/conf/cls_new.cnf
[server default]
#workdir on the management server
manager_workdir=/masterha/cls_new/
manager_log=/masterha/cls_new/manager.log

#workdir on the node for mysql server
master_binlog_dir=/data/mysql_3358/data/

#自动故障VIP切换调用脚本
master_ip_failover_script=/etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128

#手动故障切换调用脚本
master_ip_online_change_script=/etc/masterha/scripts/master_ip_online_change_vip --vip=172.16.120.128

#检测master的可用性
secondary_check_script=masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12

[server1]
hostname=172.16.120.10
port=3358
candidate_master=1

[server2]
hostname=172.16.120.11
port=3358
candidate_master=1

[server3]
hostname=172.16.120.12
port=3358
candidate_master=1

[用例测试] master too many connection

ping_type=CONNECT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
root@localhost 11:43:29 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 7 | | NULL | 0 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 2 | | NULL | 0 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 14 | | NULL | 0 | 0 |
| 1256 | repler | 172.16.120.11:59594 | NULL | Binlog Dump GTID | 952922 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1257 | repler | 172.16.120.12:56540 | NULL | Binlog Dump GTID | 952902 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 2 | | NULL | 1 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 120 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 58 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 17 | | NULL | 1 | 0 |
| 1943 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
11 rows in set (0.00 sec)

root@localhost 11:43:30 [dbms_monitor]> show global variables like '%max_connec%';
+-----------------------+---------+
| Variable_name | Value |
+-----------------------+---------+
| extra_max_connections | 1 |
| max_connect_errors | 1000000 |
| max_connections | 1024 |
+-----------------------+---------+
3 rows in set (0.01 sec)

root@localhost 11:49:34 [dbms_monitor]> set global max_connections=5;
Query OK, 0 rows affected (0.01 sec);

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Fri Oct  9 11:42:57 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 11:42:57 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 11:42:57 2020 - [info] OK.
Fri Oct 9 11:42:57 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 11:42:57 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 11:42:57 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 11:42:57 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 11:42:57 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 11:49:51 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Too many connections at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
1040 (Too many connections)
Fri Oct 9 11:49:51 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 11:49:51 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 11:49:51 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct 9 11:49:52 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 11:49:54 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:49:54 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct 9 11:49:57 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:49:57 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct 9 11:50:00 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:50:00 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..

ping_type=INSERT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
root@localhost 11:55:13 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 1 | | NULL | 0 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 18 | | NULL | 1 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 16 | | NULL | 0 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 8 | | NULL | 0 | 0 |
| 1256 | repler | 172.16.120.11:59594 | NULL | Binlog Dump GTID | 953626 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1257 | repler | 172.16.120.12:56540 | NULL | Binlog Dump GTID | 953606 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 6 | | NULL | 1 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 103 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 41 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 1 | | NULL | 1 | 0 |
| 1943 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 2160 | mha | 172.16.120.13:34660 | NULL | Sleep | 2 | | NULL | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 11:55:14 [dbms_monitor]> show global variables like '%max_connec%';
+-----------------------+---------+
| Variable_name | Value |
+-----------------------+---------+
| extra_max_connections | 1 |
| max_connect_errors | 1000000 |
| max_connections | 1024 |
+-----------------------+---------+
3 rows in set (0.04 sec)

root@localhost 11:55:19 [dbms_monitor]> set global max_connections=5;
Query OK, 0 rows affected (0.00 sec)

root@localhost 11:55:25 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 6 | | NULL | 0 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 3 | | NULL | 0 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 31 | | NULL | 0 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 3 | | NULL | 1 | 0 |
| 1256 | repler | 172.16.120.11:59594 | NULL | Binlog Dump GTID | 953641 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1257 | repler | 172.16.120.12:56540 | NULL | Binlog Dump GTID | 953621 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 0 | | NULL | 0 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 118 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 56 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 6 | | NULL | 1 | 0 |
| 1943 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 2160 | mha | 172.16.120.13:34660 | NULL | Sleep | 2 | | NULL | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

ping_type=INSERT是长连接, 不会感知too many connection.

手动kill掉mha连接

1
2
root@localhost 11:55:29 [dbms_monitor]> kill 2160;
Query OK, 0 rows affected (0.01 sec)

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Fri Oct  9 11:54:48 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 11:54:48 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 11:54:48 2020 - [info] OK.
Fri Oct 9 11:54:48 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 11:54:48 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 11:54:48 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 11:54:48 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 11:54:48 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 11:56:42 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct 9 11:56:42 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 11:56:42 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 11:56:43 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
ERROR 1040 (HY000): Too many connections
Monitoring server 172.16.120.11 is reachable, Master is not writable from 172.16.120.11. OK.
ERROR 1040 (HY000): Too many connections
Monitoring server 172.16.120.12 is reachable, Master is not writable from 172.16.120.12. OK.
Fri Oct 9 11:56:43 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct 9 11:56:45 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:56:45 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct 9 11:56:48 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:56:48 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct 9 11:56:51 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:56:51 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct 9 11:56:54 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:56:54 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct 9 11:56:57 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)
Fri Oct 9 11:56:57 2020 - [info] Got MySQL error 1040, but this is not a MySQL crash. Continue health check..
Fri Oct 9 11:57:00 2020 - [warning] Got error on MySQL connect: 1040 (Too many connections)

[用例测试] master hang

ping_type=CONNECT

master hang不好模拟, 这里间接模拟. 需要将ping_select的执行的select 1改为select innodb_table查询一个innodb表

1
2
3
4
5
6
7
8
9
sub ping_select($) {
my $self = shift;
my $log = $self->{logger};
my $dbh = $self->{dbh};
my ( $query, $sth, $href );
eval {
$dbh->{RaiseError} = 1;
#$sth = $dbh->prepare("SELECT 1 As Value");
$sth = $dbh->prepare("SELECT 1 As Value from infra.chk_masterha limit 1");

然后修改innodb_thread_concurrency

1
2
root@localhost 12:25:34 [dbms_monitor]> set global innodb_thread_concurrency=1;
Query OK, 0 rows affected (0.00 sec)

手动执行一个查询, 查询innodb表, 这样mha的select会被阻塞

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
root@localhost 12:25:45 [dbms_monitor]> select sleep(600) from infra.chk_masterha limit 1;



root@localhost 12:29:09 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+---------------------------------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+---------------------------------------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 16 | | NULL | 1 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 1 | | NULL | 0 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 4 | | NULL | 1 | 0 |
| 1256 | repler | 172.16.120.11:59594 | NULL | Binlog Dump GTID | 955662 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1257 | repler | 172.16.120.12:56540 | NULL | Binlog Dump GTID | 955642 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 11 | | NULL | 1 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 96 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 34 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 6 | | NULL | 0 | 0 |
| 1943 | root | localhost | dbms_monitor | Query | 21 | User sleep | select sleep(600) from infra.chk_masterha limit 1 | 0 | 0 |
| 2260 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 2303 | mha | 172.16.120.13:34982 | NULL | Query | 20 | Sending data | SELECT 1 As Value from infra.chk_masterha limit 1 | 0 | 0 |
| 2305 | mha | 172.16.120.13:34988 | NULL | Query | 17 | Sending data | SELECT 1 As Value from infra.chk_masterha limit 1 | 0 | 0 |
| 2308 | mha | 172.16.120.13:34994 | NULL | Query | 14 | Sending data | SELECT 1 As Value from infra.chk_masterha limit 1 | 0 | 0 |
| 2310 | mha | 172.16.120.13:34998 | NULL | Query | 11 | Sending data | SELECT 1 As Value from infra.chk_masterha limit 1 | 0 | 0 |
| 2312 | mha | 172.16.120.13:35002 | NULL | Query | 8 | Sending data | SELECT 1 As Value from infra.chk_masterha limit 1 | 0 | 0 |
| 2314 | mha | 172.16.120.13:35006 | NULL | Query | 5 | Sending data | SELECT 1 As Value from infra.chk_masterha limit 1 | 0 | 0 |
| 2317 | mha | 172.16.120.13:35010 | NULL | Query | 2 | Sending data | SELECT 1 As Value from infra.chk_masterha limit 1 | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+---------------------------------------------------+-----------+---------------+
19 rows in set (0.00 sec)

结论: 不会failover, mha manager可能报错退出

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Fri Oct  9 12:28:44 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 12:28:44 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 12:28:44 2020 - [info] OK.
Fri Oct 9 12:28:44 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 12:28:44 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 12:28:44 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 12:28:44 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 12:28:44 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 12:28:53 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:28:53 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 12:28:53 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 12:28:53 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 12:28:53 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct 9 12:28:53 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 12:28:56 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:28:56 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 12:28:59 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:28:59 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

...

Fri Oct 9 12:30:47 2020 - [warning] Got timeout on MySQL Ping(CONNECT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:30:47 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

手动ctrl+c终止select sleep(600) from infra.chk_masterha limit 1后, mha manager报错退出了

Fri Oct 9 12:30:49 2020 - [warning] Got error when monitoring master: at /usr/local/share/perl5/MHA/MasterMonitor.pm line 489.
Fri Oct 9 12:30:49 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln491] Target master's advisory lock is already held by someone. Please check whether you monitor the same master from multiple monitoring processes.
Fri Oct 9 12:30:49 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln511] Error happened on health checking. at /usr/local/bin/masterha_manager line 50.
Fri Oct 9 12:30:49 2020 - [error][/usr/local/share/perl5/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Fri Oct 9 12:30:49 2020 - [info] Got exit code 1 (Not master dead).

ping_type=INSERT

master hang不好模拟, 这里间接模拟. 修改innodb_thread_concurrency

1
2
root@localhost 12:25:34 [dbms_monitor]> set global innodb_thread_concurrency=1;
Query OK, 0 rows affected (0.00 sec)

手动执行一个查询, 查询innodb表, 这样mha的insert会被阻塞

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
root@localhost 12:35:21 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 3 | | NULL | 1 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 1 | | NULL | 1 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 8 | | NULL | 0 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 1 | | NULL | 0 | 0 |
| 1256 | repler | 172.16.120.11:59594 | NULL | Binlog Dump GTID | 956039 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1257 | repler | 172.16.120.12:56540 | NULL | Binlog Dump GTID | 956019 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 28 | | NULL | 1 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 113 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 51 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 13 | | NULL | 0 | 0 |
| 1943 | root | localhost | dbms_monitor | Query | 15 | User sleep | select sleep(600) from infra.chk_masterha limit 1 | 0 | 0 |
| 2260 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 2395 | mha | 172.16.120.13:35206 | NULL | Query | 13 | update | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam | 0 | 0 |
| 2398 | mha | 172.16.120.13:35208 | NULL | Query | 11 | update | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam | 0 | 0 |
| 2400 | mha | 172.16.120.11:32908 | NULL | Query | 10 | update | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam | 0 | 0 |
| 2401 | mha | 172.16.120.13:35216 | NULL | Query | 8 | update | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam | 0 | 0 |
| 2403 | mha | 172.16.120.12:58066 | NULL | Query | 7 | update | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam | 0 | 0 |
| 2404 | mha | 172.16.120.13:35222 | NULL | Query | 5 | update | INSERT INTO infra.chk_masterha values (1,unix_timestamp()) ON DUPLICATE KEY UPDATE val=unix_timestam | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+-----------+---------------+
18 rows in set (0.00 sec)

结论: 会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
Fri Oct  9 12:35:00 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 12:35:01 2020 - [info] GTID failover mode = 1
Fri Oct 9 12:35:01 2020 - [info] Dead Servers:
Fri Oct 9 12:35:01 2020 - [info] Alive Servers:
Fri Oct 9 12:35:01 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:01 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 12:35:01 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 12:35:01 2020 - [info] Alive Slaves:
Fri Oct 9 12:35:01 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:01 2020 - [info] GTID ON
Fri Oct 9 12:35:01 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:01 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:01 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:01 2020 - [info] GTID ON
Fri Oct 9 12:35:01 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:01 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:01 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:01 2020 - [info] Checking slave configurations..
Fri Oct 9 12:35:01 2020 - [info] Checking replication filtering settings..
Fri Oct 9 12:35:01 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 12:35:01 2020 - [info] Replication filtering check ok.
Fri Oct 9 12:35:01 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 12:35:01 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 12:35:01 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 12:35:01 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 12:35:01 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 12:35:01 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 12:35:01 2020 - [info] OK.
Fri Oct 9 12:35:01 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 12:35:01 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 12:35:01 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 12:35:01 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 12:35:01 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 12:35:16 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:35:16 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 12:35:16 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 12:35:17 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 12:35:19 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:35:19 2020 - [warning] Connection failed 2 time(s)..
Monitoring server 172.16.120.11 is reachable, Master is not writable from 172.16.120.11. OK.
Fri Oct 9 12:35:22 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:35:22 2020 - [warning] Connection failed 3 time(s)..
Monitoring server 172.16.120.12 is reachable, Master is not writable from 172.16.120.12. OK.
Fri Oct 9 12:35:23 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct 9 12:35:25 2020 - [warning] Got timeout on MySQL Ping(INSERT) child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
Fri Oct 9 12:35:25 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 12:35:25 2020 - [warning] Master is not reachable from health checker!
Fri Oct 9 12:35:25 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Fri Oct 9 12:35:25 2020 - [warning] SSH is reachable.
Fri Oct 9 12:35:25 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Fri Oct 9 12:35:25 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Oct 9 12:35:25 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct 9 12:35:25 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct 9 12:35:27 2020 - [info] GTID failover mode = 1
Fri Oct 9 12:35:27 2020 - [info] Dead Servers:
Fri Oct 9 12:35:27 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:27 2020 - [info] Alive Servers:
Fri Oct 9 12:35:27 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 12:35:27 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 12:35:27 2020 - [info] Alive Slaves:
Fri Oct 9 12:35:27 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:27 2020 - [info] GTID ON
Fri Oct 9 12:35:27 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:27 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:27 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:27 2020 - [info] GTID ON
Fri Oct 9 12:35:27 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:27 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:27 2020 - [info] Checking slave configurations..
Fri Oct 9 12:35:27 2020 - [info] Checking replication filtering settings..
Fri Oct 9 12:35:27 2020 - [info] Replication filtering check ok.
Fri Oct 9 12:35:27 2020 - [info] Master is down!
Fri Oct 9 12:35:27 2020 - [info] Terminating monitoring script.
Fri Oct 9 12:35:27 2020 - [info] Got exit code 20 (Master dead).
Fri Oct 9 12:35:27 2020 - [info] MHA::MasterFailover version 0.58.
Fri Oct 9 12:35:27 2020 - [info] Starting master failover.
Fri Oct 9 12:35:27 2020 - [info]
Fri Oct 9 12:35:27 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Oct 9 12:35:27 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] GTID failover mode = 1
Fri Oct 9 12:35:28 2020 - [info] Dead Servers:
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Alive Servers:
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 12:35:28 2020 - [info] Alive Slaves:
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info] Starting GTID based failover.
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct 9 12:35:28 2020 - [info] Executing master IP deactivation script:
Fri Oct 9 12:35:28 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root
Disabling the VIP on old master: 172.16.120.10
start down vipRTNETLINK answers: Cannot assign requested address
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Fri Oct 9 12:35:28 2020 - [info] done.
Fri Oct 9 12:35:28 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct 9 12:35:28 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000009:827239
Fri Oct 9 12:35:28 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:8822-10390
Fri Oct 9 12:35:28 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000009:827239
Fri Oct 9 12:35:28 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:8822-10390
Fri Oct 9 12:35:28 2020 - [info] Oldest slaves:
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] Searching new master from slaves..
Fri Oct 9 12:35:28 2020 - [info] Candidate masters from the configuration file:
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 12:35:28 2020 - [info] GTID ON
Fri Oct 9 12:35:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 12:35:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 12:35:28 2020 - [info] Non-candidate masters:
Fri Oct 9 12:35:28 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Oct 9 12:35:28 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 12:35:28 2020 - [info] Starting master failover..
Fri Oct 9 12:35:28 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
+--172.16.120.12(172.16.120.12:3358)
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] Waiting all logs to be applied..
Fri Oct 9 12:35:28 2020 - [info] done.
Fri Oct 9 12:35:28 2020 - [info] Getting new master's binlog name and position..
Fri Oct 9 12:35:28 2020 - [info] mysql-bin.000008:811243
Fri Oct 9 12:35:28 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Oct 9 12:35:28 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000008, 811243, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-10390,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Fri Oct 9 12:35:28 2020 - [info] Executing master IP activate script:
Fri Oct 9 12:35:28 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Oct 9 12:35:28 2020 - [info] OK.
Fri Oct 9 12:35:28 2020 - [info] ** Finished master recovery successfully.
Fri Oct 9 12:35:28 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Oct 9 12:35:28 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 44798. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201009123527.log if it takes time..
Fri Oct 9 12:35:29 2020 - [info]
Fri Oct 9 12:35:29 2020 - [info] Log messages from 172.16.120.12 ...
Fri Oct 9 12:35:29 2020 - [info]
Fri Oct 9 12:35:28 2020 - [info] Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Fri Oct 9 12:35:28 2020 - [info] Executed CHANGE MASTER.
Fri Oct 9 12:35:28 2020 - [info] Slave started.
Fri Oct 9 12:35:28 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-10390,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Fri Oct 9 12:35:29 2020 - [info] End of log messages from 172.16.120.12.
Fri Oct 9 12:35:29 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Fri Oct 9 12:35:29 2020 - [info] All new slave servers recovered successfully.
Fri Oct 9 12:35:29 2020 - [info]
Fri Oct 9 12:35:29 2020 - [info] * Phase 5: New master cleanup phase..
Fri Oct 9 12:35:29 2020 - [info]
Fri Oct 9 12:35:29 2020 - [info] Resetting slave info on the new master..
Fri Oct 9 12:35:29 2020 - [info] 172.16.120.11: Resetting slave info succeeded.
Fri Oct 9 12:35:29 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct 9 12:35:29 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct 9 12:35:29 2020 - [info] Sending mail..

以下情况都不会failover, 即便是手动failover指定了 –master_state=dead 也不行

1
2
3
4
5
6
7
8
9
10
11
12
13
our @ALIVE_ERROR_CODES = (
1040, # ER_CON_COUNT_ERROR -- too many connection
1042, # ER_BAD_HOST_ERROR -- Can't get hostname for your address
1043, # ER_HANDSHAKE_ERROR -- Bad handshake
1044, # ER_DBACCESS_DENIED_ERROR -- Access denied for user '%s'@'%s' to database '%s'
1045, # ER_ACCESS_DENIED_ERROR -- Access denied for user '%s'@'%s' (using password: %s)
1129, # ER_HOST_IS_BLOCKED -- Host '%s' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'
1130, # ER_HOST_NOT_PRIVILEGED -- Host '%s' is not allowed to connect to this MySQL server
1203, # ER_TOO_MANY_USER_CONNECTIONS -- User %s already has more than 'max_user_connections' active connections
1226, # ER_USER_LIMIT_REACHED -- User '%s' has exceeded the '%s' resource (current value: %ld)
1251, # ER_NOT_SUPPORTED_AUTH_MODE -- Client does not support authentication protocol requested by server; consider upgrading MySQL client
1275, # ER_SERVER_IS_IN_SECURE_AUTH_MODE -- Server is running in --secure-auth mode, but '%s'@'%s' has a password in the old format; please change the password to the new format
);

详见MHA-为什么too many connection不会failover?

[用例测试] master 与 mha manager间网络异常1

Manager <– 不通 –> Master
Manager <– 正常 –> S1 <– 正常 –> master
Manager <– 正常 –> S2 <– 正常 –> master

ping_type=CONNECT

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
Fri Oct  9 15:29:50 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 15:29:51 2020 - [info] GTID failover mode = 1
Fri Oct 9 15:29:51 2020 - [info] Dead Servers:
Fri Oct 9 15:29:51 2020 - [info] Alive Servers:
Fri Oct 9 15:29:51 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:29:51 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 15:29:51 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 15:29:51 2020 - [info] Alive Slaves:
Fri Oct 9 15:29:51 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:29:51 2020 - [info] GTID ON
Fri Oct 9 15:29:51 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:29:51 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:29:51 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:29:51 2020 - [info] GTID ON
Fri Oct 9 15:29:51 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:29:51 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:29:51 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:29:51 2020 - [info] Checking slave configurations..
Fri Oct 9 15:29:51 2020 - [info] Checking replication filtering settings..
Fri Oct 9 15:29:51 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 15:29:51 2020 - [info] Replication filtering check ok.
Fri Oct 9 15:29:51 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 15:29:51 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 15:29:51 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 15:29:51 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 15:29:51 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 15:29:51 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 15:29:51 2020 - [info] OK.
Fri Oct 9 15:29:51 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 15:29:51 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 15:29:51 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 15:29:51 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 15:29:51 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 15:32:56 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:32:56 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 15:32:56 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct 9 15:32:56 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 15:33:00 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:00 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:33:01 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct 9 15:33:03 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:03 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:33:06 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:06 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:33:06 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:33:09 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:09 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:33:09 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 15:33:09 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct 9 15:33:09 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 15:33:12 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:12 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:33:14 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct 9 15:33:15 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:15 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:33:18 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:18 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:33:18 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:33:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:21 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:33:21 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 15:33:21 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct 9 15:33:21 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 15:33:24 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:24 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:33:26 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct 9 15:33:27 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:27 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:33:30 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:30 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:33:30 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:33:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:33:33 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:33:33 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 15:33:33 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct 9 15:33:34 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.

ping_type=INSERT

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

此时manager工作正常, 因为ping_type=INSERT是长连接.

kill连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@localhost 15:39:31 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 22 | | NULL | 0 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 9 | | NULL | 1 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 2 | | NULL | 1 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 19 | | NULL | 0 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 111 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 49 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 4 | | NULL | 1 | 0 |
| 2409 | repler | 172.16.120.11:32918 | NULL | Binlog Dump GTID | 10898 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 2411 | repler | 172.16.120.12:58084 | NULL | Binlog Dump GTID | 10873 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 2627 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 2836 | mha | 172.16.120.13:35810 | NULL | Sleep | 2 | | NULL | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 15:39:39 [dbms_monitor]> kill 2836;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
Fri Oct  9 15:37:54 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 15:37:55 2020 - [info] GTID failover mode = 1
Fri Oct 9 15:37:55 2020 - [info] Dead Servers:
Fri Oct 9 15:37:55 2020 - [info] Alive Servers:
Fri Oct 9 15:37:55 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:37:55 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 15:37:55 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 15:37:55 2020 - [info] Alive Slaves:
Fri Oct 9 15:37:55 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:37:55 2020 - [info] GTID ON
Fri Oct 9 15:37:55 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:37:55 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:37:55 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:37:55 2020 - [info] GTID ON
Fri Oct 9 15:37:55 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:37:55 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:37:55 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:37:55 2020 - [info] Checking slave configurations..
Fri Oct 9 15:37:55 2020 - [info] Checking replication filtering settings..
Fri Oct 9 15:37:55 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 15:37:55 2020 - [info] Replication filtering check ok.
Fri Oct 9 15:37:55 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 15:37:55 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 15:37:55 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 15:37:55 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 15:37:55 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 15:37:55 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 15:37:55 2020 - [info] OK.
Fri Oct 9 15:37:55 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 15:37:55 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 15:37:55 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 15:37:55 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 15:37:55 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 15:39:46 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct 9 15:39:46 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:39:46 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Master is reachable from 172.16.120.11!
Fri Oct 9 15:39:47 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 15:39:51 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct 9 15:39:52 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:39:52 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:39:55 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:39:55 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:39:58 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:39:58 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:39:58 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:40:01 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:40:01 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:40:01 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 15:40:01 2020 - [info] Executing SSH check script: exit 0
Master is reachable from 172.16.120.11!
Fri Oct 9 15:40:01 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 15:40:04 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:40:04 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:40:06 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Fri Oct 9 15:40:07 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:40:07 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:40:10 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:40:10 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:40:10 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:40:13 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:40:13 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:40:13 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:40:13 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Master is reachable from 172.16.120.11!
Fri Oct 9 15:40:13 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 15:40:14 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 15:40:14 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.

[用例测试] master 与 mha manager间网络异常2

Manager <– 不通 –> Master
Manager <– 不通 –> S1 <– 正常 –> master
Manager <– 正常 –> S2 <– 正常 –> master

ping_type=CONNECT

slave-1

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

此时manager已经无法连通slave-1

1
2
3
4
5
6
7
8
9
10
11
12
#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.349 ms
64 bytes from centos-2 (172.16.120.11): icmp_seq=2 ttl=64 time=0.651 ms
^C
--- centos-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.349/0.500/0.651/0.151 ms

[root@centos-4 15:48:55 /usr/local/share/perl5/MHA]
#ssh centos-2
^C

master

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Fri Oct  9 15:48:03 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 15:48:05 2020 - [info] GTID failover mode = 1
Fri Oct 9 15:48:05 2020 - [info] Dead Servers:
Fri Oct 9 15:48:05 2020 - [info] Alive Servers:
Fri Oct 9 15:48:05 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:48:05 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 15:48:05 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 15:48:05 2020 - [info] Alive Slaves:
Fri Oct 9 15:48:05 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:48:05 2020 - [info] GTID ON
Fri Oct 9 15:48:05 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:48:05 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:48:05 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:48:05 2020 - [info] GTID ON
Fri Oct 9 15:48:05 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:48:05 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:48:05 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:48:05 2020 - [info] Checking slave configurations..
Fri Oct 9 15:48:05 2020 - [info] Checking replication filtering settings..
Fri Oct 9 15:48:05 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 15:48:05 2020 - [info] Replication filtering check ok.
Fri Oct 9 15:48:05 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 15:48:05 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 15:48:05 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 15:48:05 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 15:48:05 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 15:48:05 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 15:48:05 2020 - [info] OK.
Fri Oct 9 15:48:05 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 15:48:05 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 15:48:05 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 15:48:05 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 15:48:05 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 15:50:40 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:50:40 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 15:50:40 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:50:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:50:44 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:50:45 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 15:50:45 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 15:50:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:50:47 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:50:50 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:50:50 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:50:50 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:50:53 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:50:53 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:50:53 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 15:50:53 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:50:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:50:56 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:50:58 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 15:50:58 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 15:50:59 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:50:59 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:51:02 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:51:02 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:51:02 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:51:05 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:51:05 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:51:05 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 15:51:05 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:51:05 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 15:51:05 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 15:51:09 2020 - [warning] Got timeout on Secondary Check child process and killed it! at /usr/local/share/perl5/MHA/HealthCheck.pm line 435.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!

ping_type=INSERT

slave-1

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

此时manager已经无法连通slave-1

1
2
3
4
5
6
7
8
9
10
11
12
#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.349 ms
64 bytes from centos-2 (172.16.120.11): icmp_seq=2 ttl=64 time=0.651 ms
^C
--- centos-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.349/0.500/0.651/0.151 ms

[root@centos-4 15:48:55 /usr/local/share/perl5/MHA]
#ssh centos-2
^C

master

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

因为ping_type=INSERT是长连接,1 所以此时无异常.

kill连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@localhost 15:39:45 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 3 | | NULL | 1 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 1 | | NULL | 0 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 8 | | NULL | 1 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 11 | | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 8 | | NULL | 0 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 89 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 26 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 3 | | NULL | 0 | 0 |
| 2409 | repler | 172.16.120.11:32918 | NULL | Binlog Dump GTID | 11837 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 2411 | repler | 172.16.120.12:58084 | NULL | Binlog Dump GTID | 11812 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 2627 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 2953 | mha | 172.16.120.13:36174 | NULL | Sleep | 0 | | NULL | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 15:55:18 [dbms_monitor]> kill 2953;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
Fri Oct  9 15:52:43 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 15:52:44 2020 - [info] GTID failover mode = 1
Fri Oct 9 15:52:44 2020 - [info] Dead Servers:
Fri Oct 9 15:52:44 2020 - [info] Alive Servers:
Fri Oct 9 15:52:44 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:52:44 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 15:52:44 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 15:52:44 2020 - [info] Alive Slaves:
Fri Oct 9 15:52:44 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:52:44 2020 - [info] GTID ON
Fri Oct 9 15:52:44 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:52:44 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:52:44 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 15:52:44 2020 - [info] GTID ON
Fri Oct 9 15:52:44 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:52:44 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 15:52:44 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 15:52:44 2020 - [info] Checking slave configurations..
Fri Oct 9 15:52:44 2020 - [info] Checking replication filtering settings..
Fri Oct 9 15:52:44 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 15:52:44 2020 - [info] Replication filtering check ok.
Fri Oct 9 15:52:44 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 15:52:44 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 15:52:45 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 15:52:45 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 15:52:45 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 15:52:45 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 15:52:45 2020 - [info] OK.
Fri Oct 9 15:52:45 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 15:52:45 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 15:52:45 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 15:52:45 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 15:52:45 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 15:55:24 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct 9 15:55:24 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 15:55:24 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:55:29 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 15:55:29 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 15:55:30 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:30 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:55:33 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:33 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:55:36 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:36 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:55:36 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:55:39 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:39 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:55:39 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 15:55:39 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:55:42 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:42 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:55:44 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 15:55:44 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 15:55:45 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:45 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:55:48 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:48 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:55:48 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:55:51 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:51 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:55:51 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 15:55:51 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:55:54 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:54 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 15:55:56 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 15:55:56 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 15:55:57 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:55:57 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 15:56:00 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:56:00 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 15:56:00 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 15:56:03 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 15:56:03 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 15:56:03 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 15:56:03 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 15:56:03 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 15:56:03 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct 9 15:56:03 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.

[用例测试] master 与 mha manager间网络异常3

Manager <– 不通 –> Master
Manager <– 不通 –> S1 <– 正常 –> master
Manager <– 不通 –> S2 <– 正常 –> master

ping_type=CONNECT

slave-1, slave-2

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.442 ms
64 bytes from centos-2 (172.16.120.11): icmp_seq=2 ttl=64 time=0.441 ms
^C
--- centos-2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.441/0.441/0.442/0.021 ms

[root@centos-4 16:44:27 /usr/local/share/perl5/MHA]
#ssh centos-2
^C

[root@centos-4 16:44:30 /usr/local/share/perl5/MHA]
#ping centos-3
PING centos-3 (172.16.120.12) 56(84) bytes of data.
64 bytes from centos-3 (172.16.120.12): icmp_seq=1 ttl=64 time=0.335 ms
64 bytes from centos-3 (172.16.120.12): icmp_seq=2 ttl=64 time=0.575 ms
^C
--- centos-3 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.335/0.455/0.575/0.120 ms

[root@centos-4 16:44:34 /usr/local/share/perl5/MHA]
#ssh centos-3
^C

master

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
Fri Oct  9 16:43:25 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 16:43:26 2020 - [info] GTID failover mode = 1
Fri Oct 9 16:43:26 2020 - [info] Dead Servers:
Fri Oct 9 16:43:26 2020 - [info] Alive Servers:
Fri Oct 9 16:43:26 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:43:26 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 16:43:26 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 16:43:26 2020 - [info] Alive Slaves:
Fri Oct 9 16:43:26 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 16:43:26 2020 - [info] GTID ON
Fri Oct 9 16:43:26 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:43:26 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 16:43:26 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 16:43:26 2020 - [info] GTID ON
Fri Oct 9 16:43:26 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:43:26 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 16:43:26 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:43:26 2020 - [info] Checking slave configurations..
Fri Oct 9 16:43:26 2020 - [info] Checking replication filtering settings..
Fri Oct 9 16:43:26 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 16:43:26 2020 - [info] Replication filtering check ok.
Fri Oct 9 16:43:26 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 16:43:26 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 16:43:26 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 16:43:26 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 16:43:26 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 16:43:26 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 16:43:26 2020 - [info] OK.
Fri Oct 9 16:43:26 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 16:43:26 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 16:43:26 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 16:43:26 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 16:43:26 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 16:45:55 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:45:55 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 16:45:55 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 16:45:59 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:45:59 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 16:46:00 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 16:46:00 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 16:46:02 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:46:02 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 16:46:05 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:46:05 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 16:46:05 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 16:46:08 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:46:08 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 16:46:08 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 16:46:08 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 16:46:11 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:46:11 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 16:46:13 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 16:46:13 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 16:46:14 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:46:14 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 16:46:15 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

ping_type=INSERT

slave-1,slave-2

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#ping centos-2
PING centos-2 (172.16.120.11) 56(84) bytes of data.
64 bytes from centos-2 (172.16.120.11): icmp_seq=1 ttl=64 time=0.352 ms
^C
--- centos-2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.352/0.352/0.352/0.000 ms

[root@centos-4 17:52:38 ~]
#ping centos-3
PING centos-3 (172.16.120.12) 56(84) bytes of data.
64 bytes from centos-3 (172.16.120.12): icmp_seq=1 ttl=64 time=0.221 ms
^C
--- centos-3 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.221/0.221/0.221/0.000 ms

[root@centos-4 17:52:41 ~]
#ssh centos-2
^C

[root@centos-4 17:52:44 ~]
#ssh centos-3

master

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.11 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

由于ping_type=INSERT是长连接, 所以无异常

kill连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@localhost 17:48:11 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 14 | | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 32 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 4 | | NULL | 1 | 0 |
| 2409 | repler | 172.16.120.11:32918 | NULL | Binlog Dump GTID | 18936 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 2411 | repler | 172.16.120.12:58084 | NULL | Binlog Dump GTID | 18911 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 3192 | proxysql | 172.16.120.11:33786 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 3238 | proxysql | 172.16.120.12:59006 | NULL | Sleep | 4 | | NULL | 1 | 0 |
| 3245 | proxysql | 172.16.120.11:33888 | NULL | Sleep | 14 | | NULL | 0 | 0 |
| 3262 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 3268 | mha | 172.16.120.13:36868 | NULL | Sleep | 2 | | NULL | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
11 rows in set (0.00 sec)

root@localhost 17:53:37 [dbms_monitor]> kill 3268;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
Fri Oct  9 17:50:48 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 17:50:49 2020 - [info] GTID failover mode = 1
Fri Oct 9 17:50:49 2020 - [info] Dead Servers:
Fri Oct 9 17:50:49 2020 - [info] Alive Servers:
Fri Oct 9 17:50:49 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 17:50:49 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 17:50:49 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 17:50:49 2020 - [info] Alive Slaves:
Fri Oct 9 17:50:49 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 17:50:49 2020 - [info] GTID ON
Fri Oct 9 17:50:49 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 17:50:49 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 17:50:49 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 17:50:49 2020 - [info] GTID ON
Fri Oct 9 17:50:49 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 17:50:49 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 17:50:49 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 17:50:49 2020 - [info] Checking slave configurations..
Fri Oct 9 17:50:49 2020 - [info] Checking replication filtering settings..
Fri Oct 9 17:50:49 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 17:50:49 2020 - [info] Replication filtering check ok.
Fri Oct 9 17:50:49 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 17:50:49 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 17:50:49 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 17:50:49 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 17:50:49 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 17:50:49 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 17:50:49 2020 - [info] OK.
Fri Oct 9 17:50:49 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 17:50:49 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 17:50:49 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 17:50:49 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 17:50:49 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 17:53:43 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct 9 17:53:43 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 17:53:43 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 17:53:48 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 17:53:48 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 17:53:49 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:53:49 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 17:53:52 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:53:52 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 17:53:55 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:53:55 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 17:53:55 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 17:53:58 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:53:58 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 17:53:58 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 17:53:58 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 17:54:01 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:54:01 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 17:54:03 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 17:54:03 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 17:54:04 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:54:04 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 17:54:07 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:54:07 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 17:54:07 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 17:54:10 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:54:10 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 17:54:10 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 17:54:10 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 17:54:13 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:54:13 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 17:54:15 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
ssh: connect to host 172.16.120.11 port 22: Connection timed out
Monitoring server 172.16.120.11 is NOT reachable!
Fri Oct 9 17:54:15 2020 - [warning] At least one of monitoring servers is not reachable from this script. This is likely a network problem. Failover should not happen.
Fri Oct 9 17:54:16 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 17:54:16 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 17:54:16 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..

[用例测试] master 与 mha manager间网络异常4

Manager <– 不通 –> Master
Manager <– 正常 –> S1 <– 不通 –> master
Manager <– 正常 –> S2 <– 正常 –> master

ping_type=CONNECT

master

1
2
3
4
5
6
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
Fri Oct  9 16:05:55 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 16:05:56 2020 - [info] GTID failover mode = 1
Fri Oct 9 16:05:56 2020 - [info] Dead Servers:
Fri Oct 9 16:05:56 2020 - [info] Alive Servers:
Fri Oct 9 16:05:56 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:05:56 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 16:05:56 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 16:05:56 2020 - [info] Alive Slaves:
Fri Oct 9 16:05:56 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 16:05:56 2020 - [info] GTID ON
Fri Oct 9 16:05:56 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:05:56 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 16:05:56 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 16:05:56 2020 - [info] GTID ON
Fri Oct 9 16:05:56 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:05:56 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 16:05:56 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:05:56 2020 - [info] Checking slave configurations..
Fri Oct 9 16:05:56 2020 - [info] Checking replication filtering settings..
Fri Oct 9 16:05:56 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 16:05:56 2020 - [info] Replication filtering check ok.
Fri Oct 9 16:05:56 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 16:05:56 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 16:05:56 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 16:05:56 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 16:05:56 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 16:05:56 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 16:05:56 2020 - [info] OK.
Fri Oct 9 16:05:56 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 16:05:56 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 16:05:56 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 16:05:56 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 16:05:56 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 16:06:43 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:06:43 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 16:06:43 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 16:06:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:06:47 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 16:06:48 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct 9 16:06:48 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 16:06:50 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:06:50 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 16:06:53 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:06:53 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 16:06:53 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 16:06:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:06:56 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 16:06:56 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 16:06:56 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 16:06:59 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:06:59 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 16:07:01 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct 9 16:07:01 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 16:07:02 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:07:02 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 16:07:05 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:07:05 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 16:07:05 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 16:07:05 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

ping_type=INSERT

master

1
2
3
4
5
6
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

由于ping_type=INSERT是长连接, 所以无异常

kill连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
root@localhost 15:55:23 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 19 | | NULL | 1 | 0 |
| 1249 | proxysql | 172.16.120.11:59582 | NULL | Sleep | 7 | | NULL | 0 | 0 |
| 1250 | proxysql | 172.16.120.12:56530 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 1254 | proxysql | 172.16.120.11:59592 | NULL | Sleep | 27 | | NULL | 1 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 14 | | NULL | 1 | 0 |
| 1341 | mha | 172.16.120.12:56698 | information_schema | Sleep | 114 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 51 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 9 | | NULL | 0 | 0 |
| 2409 | repler | 172.16.120.11:32918 | NULL | Binlog Dump GTID | 12703 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 2411 | repler | 172.16.120.12:58084 | NULL | Binlog Dump GTID | 12678 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 2627 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 3022 | mha | 172.16.120.13:36466 | NULL | Sleep | 2 | | NULL | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+-------+---------------------------------------------------------------+------------------+-----------+---------------+
12 rows in set (0.00 sec)

root@localhost 16:09:44 [dbms_monitor]> kill 3022;
Query OK, 0 rows affected (0.00 sec)

结论: 不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
Fri Oct  9 16:08:29 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 16:08:30 2020 - [info] GTID failover mode = 1
Fri Oct 9 16:08:30 2020 - [info] Dead Servers:
Fri Oct 9 16:08:30 2020 - [info] Alive Servers:
Fri Oct 9 16:08:30 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:08:30 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 16:08:30 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 16:08:30 2020 - [info] Alive Slaves:
Fri Oct 9 16:08:30 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 16:08:30 2020 - [info] GTID ON
Fri Oct 9 16:08:30 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:08:30 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 16:08:30 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 16:08:30 2020 - [info] GTID ON
Fri Oct 9 16:08:30 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:08:30 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 16:08:30 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 16:08:30 2020 - [info] Checking slave configurations..
Fri Oct 9 16:08:30 2020 - [info] Checking replication filtering settings..
Fri Oct 9 16:08:30 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 16:08:30 2020 - [info] Replication filtering check ok.
Fri Oct 9 16:08:30 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 16:08:30 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 16:08:30 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 16:08:30 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 16:08:30 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 16:08:30 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 16:08:30 2020 - [info] OK.
Fri Oct 9 16:08:30 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 16:08:30 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 16:08:30 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 16:08:30 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 16:08:30 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 16:09:51 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct 9 16:09:51 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 16:09:51 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 16:09:56 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct 9 16:09:57 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 16:09:57 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:09:57 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 16:10:00 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:00 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 16:10:03 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:03 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 16:10:03 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 16:10:06 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:06 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 16:10:06 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 16:10:06 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 16:10:09 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:09 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 16:10:11 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Master is reachable from 172.16.120.12!
Fri Oct 9 16:10:12 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.
Fri Oct 9 16:10:12 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:12 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 16:10:15 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:15 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 16:10:15 2020 - [warning] Secondary network check script returned errors. Failover should not start so checking server status again. Check network settings for details.
Fri Oct 9 16:10:18 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:18 2020 - [warning] Connection failed 1 time(s)..
Fri Oct 9 16:10:18 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 16:10:18 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 16:10:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 16:10:21 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 16:10:21 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 16:10:21 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Master is reachable from 172.16.120.11!
Fri Oct 9 16:10:21 2020 - [warning] Master is reachable from at least one of other monitoring servers. Failover should not happen.

[用例测试] master 与 mha manager间网络异常5

Manager <– 不通 –> Master
Manager <– 正常 –> S1 <– 不通 –> master
Manager <– 正常 –> S2 <– 不通 –> master

ping_type=CONNECT

master

1
2
3
4
5
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

结论: 会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
Fri Oct  9 18:21:13 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 18:21:14 2020 - [info] GTID failover mode = 1
Fri Oct 9 18:21:14 2020 - [info] Dead Servers:
Fri Oct 9 18:21:14 2020 - [info] Alive Servers:
Fri Oct 9 18:21:14 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:21:14 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:21:14 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:21:14 2020 - [info] Alive Slaves:
Fri Oct 9 18:21:14 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:21:14 2020 - [info] GTID ON
Fri Oct 9 18:21:14 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:21:14 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:21:14 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:21:14 2020 - [info] GTID ON
Fri Oct 9 18:21:14 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:21:14 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:21:14 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:21:14 2020 - [info] Checking slave configurations..
Fri Oct 9 18:21:14 2020 - [info] Checking replication filtering settings..
Fri Oct 9 18:21:14 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 18:21:14 2020 - [info] Replication filtering check ok.
Fri Oct 9 18:21:14 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 18:21:14 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 18:21:14 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 18:21:14 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 18:21:14 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 18:21:14 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 18:21:14 2020 - [info] OK.
Fri Oct 9 18:21:14 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 18:21:14 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 18:21:14 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 18:21:14 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 18:21:14 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 18:22:07 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (4) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 18:22:07 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 18:22:07 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Fri Oct 9 18:22:11 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 18:22:11 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 18:22:12 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Fri Oct 9 18:22:14 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 18:22:14 2020 - [warning] Connection failed 3 time(s)..
Fri Oct 9 18:22:17 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 18:22:17 2020 - [warning] Connection failed 4 time(s)..
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Fri Oct 9 18:22:18 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct 9 18:22:18 2020 - [warning] Master is not reachable from health checker!
Fri Oct 9 18:22:18 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Fri Oct 9 18:22:18 2020 - [warning] SSH is NOT reachable.
Fri Oct 9 18:22:18 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Fri Oct 9 18:22:18 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Oct 9 18:22:18 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct 9 18:22:18 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct 9 18:22:19 2020 - [info] GTID failover mode = 1
Fri Oct 9 18:22:19 2020 - [info] Dead Servers:
Fri Oct 9 18:22:19 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:19 2020 - [info] Alive Servers:
Fri Oct 9 18:22:19 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:22:19 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:22:19 2020 - [info] Alive Slaves:
Fri Oct 9 18:22:19 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:19 2020 - [info] GTID ON
Fri Oct 9 18:22:19 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:19 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:19 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:19 2020 - [info] GTID ON
Fri Oct 9 18:22:19 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:19 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:19 2020 - [info] Checking slave configurations..
Fri Oct 9 18:22:19 2020 - [info] Checking replication filtering settings..
Fri Oct 9 18:22:19 2020 - [info] Replication filtering check ok.
Fri Oct 9 18:22:19 2020 - [info] Master is down!
Fri Oct 9 18:22:19 2020 - [info] Terminating monitoring script.
Fri Oct 9 18:22:19 2020 - [info] Got exit code 20 (Master dead).
Fri Oct 9 18:22:19 2020 - [info] MHA::MasterFailover version 0.58.
Fri Oct 9 18:22:19 2020 - [info] Starting master failover.
Fri Oct 9 18:22:19 2020 - [info]
Fri Oct 9 18:22:19 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Oct 9 18:22:19 2020 - [info]
Fri Oct 9 18:22:20 2020 - [info] GTID failover mode = 1
Fri Oct 9 18:22:20 2020 - [info] Dead Servers:
Fri Oct 9 18:22:20 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:20 2020 - [info] Checking master reachability via MySQL(double check)...
Fri Oct 9 18:22:21 2020 - [info] ok.
Fri Oct 9 18:22:21 2020 - [info] Alive Servers:
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:22:21 2020 - [info] Alive Slaves:
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info] Starting GTID based failover.
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct 9 18:22:21 2020 - [info] Executing master IP deactivation script:
Fri Oct 9 18:22:21 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stop
Disabling the VIP on old master: 172.16.120.10
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Fri Oct 9 18:22:21 2020 - [info] done.
Fri Oct 9 18:22:21 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct 9 18:22:21 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000009:3084318
Fri Oct 9 18:22:21 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:10391-19970
Fri Oct 9 18:22:21 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000009:3084318
Fri Oct 9 18:22:21 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:10391-19970
Fri Oct 9 18:22:21 2020 - [info] Oldest slaves:
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] Searching new master from slaves..
Fri Oct 9 18:22:21 2020 - [info] Candidate masters from the configuration file:
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:22:21 2020 - [info] GTID ON
Fri Oct 9 18:22:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:22:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:22:21 2020 - [info] Non-candidate masters:
Fri Oct 9 18:22:21 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Oct 9 18:22:21 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:22:21 2020 - [info] Starting master failover..
Fri Oct 9 18:22:21 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
+--172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] Waiting all logs to be applied..
Fri Oct 9 18:22:21 2020 - [info] done.
Fri Oct 9 18:22:21 2020 - [info] Getting new master's binlog name and position..
Fri Oct 9 18:22:21 2020 - [info] mysql-bin.000008:3052407
Fri Oct 9 18:22:21 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Oct 9 18:22:21 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000008, 3052407, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-19970,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Fri Oct 9 18:22:21 2020 - [info] Executing master IP activate script:
Fri Oct 9 18:22:21 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Oct 9 18:22:21 2020 - [info] OK.
Fri Oct 9 18:22:21 2020 - [info] ** Finished master recovery successfully.
Fri Oct 9 18:22:21 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Oct 9 18:22:21 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 68999. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201009182219.log if it takes time..
Fri Oct 9 18:22:22 2020 - [info]
Fri Oct 9 18:22:22 2020 - [info] Log messages from 172.16.120.12 ...
Fri Oct 9 18:22:22 2020 - [info]
Fri Oct 9 18:22:21 2020 - [info] Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Fri Oct 9 18:22:21 2020 - [info] Executed CHANGE MASTER.
Fri Oct 9 18:22:21 2020 - [info] Slave started.
Fri Oct 9 18:22:21 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-19970,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Fri Oct 9 18:22:22 2020 - [info] End of log messages from 172.16.120.12.
Fri Oct 9 18:22:22 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Fri Oct 9 18:22:22 2020 - [info] All new slave servers recovered successfully.
Fri Oct 9 18:22:22 2020 - [info]
Fri Oct 9 18:22:22 2020 - [info] * Phase 5: New master cleanup phase..
Fri Oct 9 18:22:22 2020 - [info]
Fri Oct 9 18:22:22 2020 - [info] Resetting slave info on the new master..
Fri Oct 9 18:22:22 2020 - [info] 172.16.120.11: Resetting slave info succeeded.
Fri Oct 9 18:22:22 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct 9 18:22:22 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct 9 18:22:22 2020 - [info] Sending mail..

ping_type=INSERT

由于ping_type=INSERT是长连接, 所以无异常

kill连接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
root@localhost 18:24:52 [dbms_monitor]> show processlist;
+------+----------+---------------------+--------------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+------+----------+---------------------+--------------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| 1248 | proxysql | 172.16.120.10:58672 | NULL | Sleep | 1 | | NULL | 0 | 0 |
| 1264 | proxysql | 172.16.120.12:56552 | NULL | Sleep | 2 | | NULL | 0 | 0 |
| 1343 | mha | 172.16.120.11:59758 | information_schema | Sleep | 74 | | NULL | 0 | 0 |
| 1452 | proxysql | 172.16.120.10:59046 | NULL | Sleep | 2 | | NULL | 1 | 0 |
| 3192 | proxysql | 172.16.120.11:33786 | NULL | Sleep | 3 | | NULL | 1 | 0 |
| 3238 | proxysql | 172.16.120.12:59006 | NULL | Sleep | 1 | | NULL | 1 | 0 |
| 3245 | proxysql | 172.16.120.11:33888 | NULL | Sleep | 2 | | NULL | 0 | 0 |
| 3262 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 3357 | repler | 172.16.120.11:34036 | NULL | Binlog Dump GTID | 142 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 3359 | repler | 172.16.120.12:59166 | NULL | Binlog Dump GTID | 123 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 3364 | mha | 172.16.120.13:37512 | NULL | Sleep | 2 | | NULL | 0 | 0 |
+------+----------+---------------------+--------------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
11 rows in set (0.00 sec)

root@localhost 18:26:25 [dbms_monitor]> kill 3364;
Query OK, 0 rows affected (0.01 sec)

结论: 长连接断开后才会failover, 否则不会failover

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
Fri Oct  9 18:25:33 2020 - [info] MHA::MasterMonitor version 0.58.
Fri Oct 9 18:25:34 2020 - [info] GTID failover mode = 1
Fri Oct 9 18:25:34 2020 - [info] Dead Servers:
Fri Oct 9 18:25:34 2020 - [info] Alive Servers:
Fri Oct 9 18:25:34 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:25:34 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:25:34 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:25:34 2020 - [info] Alive Slaves:
Fri Oct 9 18:25:34 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:25:34 2020 - [info] GTID ON
Fri Oct 9 18:25:34 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:25:34 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:25:34 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:25:34 2020 - [info] GTID ON
Fri Oct 9 18:25:34 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:25:34 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:25:34 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:25:34 2020 - [info] Checking slave configurations..
Fri Oct 9 18:25:34 2020 - [info] Checking replication filtering settings..
Fri Oct 9 18:25:34 2020 - [info] binlog_do_db= , binlog_ignore_db=
Fri Oct 9 18:25:34 2020 - [info] Replication filtering check ok.
Fri Oct 9 18:25:34 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Fri Oct 9 18:25:34 2020 - [info] Checking SSH publickey authentication settings on the current master..
Fri Oct 9 18:25:35 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Fri Oct 9 18:25:35 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Fri Oct 9 18:25:35 2020 - [info] Checking master_ip_failover_script status:
Fri Oct 9 18:25:35 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Fri Oct 9 18:25:35 2020 - [info] OK.
Fri Oct 9 18:25:35 2020 - [warning] shutdown_script is not defined.
Fri Oct 9 18:25:35 2020 - [info] Set master ping interval 3 seconds.
Fri Oct 9 18:25:35 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Fri Oct 9 18:25:35 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Fri Oct 9 18:25:35 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..
Fri Oct 9 18:26:44 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Fri Oct 9 18:26:44 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Fri Oct 9 18:26:44 2020 - [info] Executing SSH check script: exit 0
Fri Oct 9 18:26:49 2020 - [warning] HealthCheck: Got timeout on checking SSH connection to 172.16.120.10! at /usr/local/share/perl5/MHA/HealthCheck.pm line 344.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Fri Oct 9 18:26:50 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 18:26:50 2020 - [warning] Connection failed 2 time(s)..
Fri Oct 9 18:26:53 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 18:26:53 2020 - [warning] Connection failed 3 time(s)..
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Fri Oct 9 18:26:54 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Fri Oct 9 18:26:56 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (4))
Fri Oct 9 18:26:56 2020 - [warning] Connection failed 4 time(s)..
Fri Oct 9 18:26:56 2020 - [warning] Master is not reachable from health checker!
Fri Oct 9 18:26:56 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Fri Oct 9 18:26:56 2020 - [warning] SSH is NOT reachable.
Fri Oct 9 18:26:56 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Fri Oct 9 18:26:56 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Fri Oct 9 18:26:56 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct 9 18:26:56 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Fri Oct 9 18:26:57 2020 - [info] GTID failover mode = 1
Fri Oct 9 18:26:57 2020 - [info] Dead Servers:
Fri Oct 9 18:26:57 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:57 2020 - [info] Alive Servers:
Fri Oct 9 18:26:57 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:26:57 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:26:57 2020 - [info] Alive Slaves:
Fri Oct 9 18:26:57 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:57 2020 - [info] GTID ON
Fri Oct 9 18:26:57 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:57 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:57 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:57 2020 - [info] GTID ON
Fri Oct 9 18:26:57 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:57 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:57 2020 - [info] Checking slave configurations..
Fri Oct 9 18:26:57 2020 - [info] Checking replication filtering settings..
Fri Oct 9 18:26:57 2020 - [info] Replication filtering check ok.
Fri Oct 9 18:26:57 2020 - [info] Master is down!
Fri Oct 9 18:26:57 2020 - [info] Terminating monitoring script.
Fri Oct 9 18:26:57 2020 - [info] Got exit code 20 (Master dead).
Fri Oct 9 18:26:57 2020 - [info] MHA::MasterFailover version 0.58.
Fri Oct 9 18:26:57 2020 - [info] Starting master failover.
Fri Oct 9 18:26:57 2020 - [info]
Fri Oct 9 18:26:57 2020 - [info] * Phase 1: Configuration Check Phase..
Fri Oct 9 18:26:57 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] GTID failover mode = 1
Fri Oct 9 18:26:58 2020 - [info] Dead Servers:
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Alive Servers:
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:26:58 2020 - [info] Alive Slaves:
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info] Starting GTID based failover.
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Fri Oct 9 18:26:58 2020 - [info] Executing master IP deactivation script:
Fri Oct 9 18:26:58 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stop
Disabling the VIP on old master: 172.16.120.10
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Fri Oct 9 18:26:58 2020 - [info] done.
Fri Oct 9 18:26:58 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Fri Oct 9 18:26:58 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] * Phase 3: Master Recovery Phase..
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000009:3101017
Fri Oct 9 18:26:58 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:19971-20041
Fri Oct 9 18:26:58 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000009:3101017
Fri Oct 9 18:26:58 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:19971-20041
Fri Oct 9 18:26:58 2020 - [info] Oldest slaves:
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] * Phase 3.3: Determining New Master Phase..
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] Searching new master from slaves..
Fri Oct 9 18:26:58 2020 - [info] Candidate masters from the configuration file:
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Fri Oct 9 18:26:58 2020 - [info] GTID ON
Fri Oct 9 18:26:58 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Fri Oct 9 18:26:58 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Fri Oct 9 18:26:58 2020 - [info] Non-candidate masters:
Fri Oct 9 18:26:58 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Fri Oct 9 18:26:58 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Fri Oct 9 18:26:58 2020 - [info] Starting master failover..
Fri Oct 9 18:26:58 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
+--172.16.120.12(172.16.120.12:3358)
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] Waiting all logs to be applied..
Fri Oct 9 18:26:58 2020 - [info] done.
Fri Oct 9 18:26:58 2020 - [info] Getting new master's binlog name and position..
Fri Oct 9 18:26:58 2020 - [info] mysql-bin.000008:3068991
Fri Oct 9 18:26:58 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Fri Oct 9 18:26:58 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000008, 3068991, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20041,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Fri Oct 9 18:26:58 2020 - [info] Executing master IP activate script:
Fri Oct 9 18:26:58 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Fri Oct 9 18:26:58 2020 - [info] OK.
Fri Oct 9 18:26:58 2020 - [info] ** Finished master recovery successfully.
Fri Oct 9 18:26:58 2020 - [info] * Phase 3: Master Recovery Phase completed.
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] * Phase 4: Slaves Recovery Phase..
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Fri Oct 9 18:26:58 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 69850. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201009182657.log if it takes time..
Fri Oct 9 18:26:59 2020 - [info]
Fri Oct 9 18:26:59 2020 - [info] Log messages from 172.16.120.12 ...
Fri Oct 9 18:26:59 2020 - [info]
Fri Oct 9 18:26:58 2020 - [info] Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Fri Oct 9 18:26:58 2020 - [info] Executed CHANGE MASTER.
Fri Oct 9 18:26:58 2020 - [info] Slave started.
Fri Oct 9 18:26:58 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20041,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Fri Oct 9 18:26:59 2020 - [info] End of log messages from 172.16.120.12.
Fri Oct 9 18:26:59 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Fri Oct 9 18:26:59 2020 - [info] All new slave servers recovered successfully.
Fri Oct 9 18:26:59 2020 - [info]
Fri Oct 9 18:26:59 2020 - [info] * Phase 5: New master cleanup phase..
Fri Oct 9 18:26:59 2020 - [info]
Fri Oct 9 18:26:59 2020 - [info] Resetting slave info on the new master..
Fri Oct 9 18:26:59 2020 - [info] 172.16.120.11: Resetting slave info succeeded.
Fri Oct 9 18:26:59 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct 9 18:26:59 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Fri Oct 9 18:26:59 2020 - [info] Sending mail..

[用例测试] master挂了, 且slave也有问题1(部分slave宕机)

master挂了, 在此之前slave-1宕机了

ping_type=CONNECT

启动manager后, 关闭slave-1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 10:28:35 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 10:28:37 2020 - [info] GTID failover mode = 1
Sat Oct 10 10:28:37 2020 - [info] Dead Servers:
Sat Oct 10 10:28:37 2020 - [info] Alive Servers:
Sat Oct 10 10:28:37 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:28:37 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 10:28:37 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 10:28:37 2020 - [info] Alive Slaves:
Sat Oct 10 10:28:37 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 10:28:37 2020 - [info] GTID ON
Sat Oct 10 10:28:37 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:28:37 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 10:28:37 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 10:28:37 2020 - [info] GTID ON
Sat Oct 10 10:28:37 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:28:37 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 10:28:37 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:28:37 2020 - [info] Checking slave configurations..
Sat Oct 10 10:28:37 2020 - [info] Checking replication filtering settings..
Sat Oct 10 10:28:37 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 10:28:37 2020 - [info] Replication filtering check ok.
Sat Oct 10 10:28:37 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 10:28:37 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 10:28:37 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 10:28:37 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 10:28:37 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 10:28:37 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 10:28:37 2020 - [info] OK.
Sat Oct 10 10:28:37 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 10:28:37 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 10:28:37 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 10:28:37 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 10:28:37 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

关闭slave-1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@centos-2 10:19:38 ~]
#mysql -uroot -p -S /data/mysql_3358/run/mysql.sock dbms_monitor
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 2118
Server version: 5.7.31-34-log Percona Server (GPL), Release 34, Revision 2e68637

Copyright (c) 2009-2020 Percona LLC and/or its affiliates
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@localhost 10:20:52 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.03 sec)

关闭后, mha manager仍然是正常的

关闭master

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@centos-1 10:20:35 ~]
#mysql -uroot -p -S /data/mysql_3358/run/mysql.sock dbms_monitor
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3413
Server version: 5.7.31-34-log Percona Server (GPL), Release 34, Revision 2e68637

Copyright (c) 2009-2020 Percona LLC and/or its affiliates
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@localhost 10:20:47 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.03 sec)

结论: 会触发failover, 但failover失败

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
Sat Oct 10 10:50:38 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 10:50:38 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Sat Oct 10 10:50:38 2020 - [info] Executing SSH check script: exit 0
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Sat Oct 10 10:50:38 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 10:50:38 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 10:50:41 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 10:50:41 2020 - [warning] Connection failed 2 time(s)..
Sat Oct 10 10:50:44 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 10:50:44 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 10:50:47 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 10:50:47 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 10:50:47 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 10:50:47 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 10:50:47 2020 - [warning] SSH is reachable.
Sat Oct 10 10:50:47 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 10:50:47 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 10:50:47 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 10:50:47 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 10:50:48 2020 - [info] GTID failover mode = 1
Sat Oct 10 10:50:48 2020 - [info] Dead Servers:
Sat Oct 10 10:50:48 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:50:48 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 10:50:48 2020 - [info] Alive Servers:
Sat Oct 10 10:50:48 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 10:50:48 2020 - [info] Alive Slaves:
Sat Oct 10 10:50:48 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 10:50:48 2020 - [info] GTID ON
Sat Oct 10 10:50:48 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:50:48 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 10:50:48 2020 - [info] Checking slave configurations..
Sat Oct 10 10:50:48 2020 - [info] Checking replication filtering settings..
Sat Oct 10 10:50:48 2020 - [info] Replication filtering check ok.
Sat Oct 10 10:50:48 2020 - [info] Master is down!
Sat Oct 10 10:50:48 2020 - [info] Terminating monitoring script.
Sat Oct 10 10:50:48 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 10:50:48 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 10:50:48 2020 - [info] Starting master failover.
Sat Oct 10 10:50:48 2020 - [info]
Sat Oct 10 10:50:48 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 10:50:48 2020 - [info]
Sat Oct 10 10:50:49 2020 - [info] GTID failover mode = 1
Sat Oct 10 10:50:49 2020 - [info] Dead Servers:
Sat Oct 10 10:50:49 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:50:49 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 10:50:49 2020 - [info] Checking master reachability via MySQL(double check)...
Sat Oct 10 10:50:49 2020 - [info] ok.
Sat Oct 10 10:50:49 2020 - [info] Alive Servers:
Sat Oct 10 10:50:49 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 10:50:49 2020 - [info] Alive Slaves:
Sat Oct 10 10:50:49 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 10:50:49 2020 - [info] GTID ON
Sat Oct 10 10:50:49 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:50:49 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 10:50:49 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492] Server 172.16.120.11(172.16.120.11:3358) is dead, but must be alive! Check server settings.
Sat Oct 10 10:50:49 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/share/perl5/MHA/MasterFailover.pm line 269.

ping_type=INSERT

其实和connect应该是一样的, 不过还是走一遍流程

启动manager后, 关闭slave-1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 10:59:07 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 10:59:09 2020 - [info] GTID failover mode = 1
Sat Oct 10 10:59:09 2020 - [info] Dead Servers:
Sat Oct 10 10:59:09 2020 - [info] Alive Servers:
Sat Oct 10 10:59:09 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:59:09 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 10:59:09 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 10:59:09 2020 - [info] Alive Slaves:
Sat Oct 10 10:59:09 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 10:59:09 2020 - [info] GTID ON
Sat Oct 10 10:59:09 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:59:09 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 10:59:09 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 10:59:09 2020 - [info] GTID ON
Sat Oct 10 10:59:09 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:59:09 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 10:59:09 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 10:59:09 2020 - [info] Checking slave configurations..
Sat Oct 10 10:59:09 2020 - [info] Checking replication filtering settings..
Sat Oct 10 10:59:09 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 10:59:09 2020 - [info] Replication filtering check ok.
Sat Oct 10 10:59:09 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 10:59:09 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 10:59:09 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 10:59:09 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 10:59:09 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 10:59:09 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 10:59:09 2020 - [info] OK.
Sat Oct 10 10:59:09 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 10:59:09 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 10:59:09 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 10:59:09 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 10:59:09 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..

关闭slave-1

1
2
[root@centos-2 10:56:24 /usr/local/Percona-Server-5.7.29-32-Linux.x86_64.ssl101]
#2020-10-10T03:00:49.502943Z mysqld_safe mysqld from pid file /data/mysql_3358/run/mysql.pid ended

关闭master

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
[root@centos-1 11:07:08 ~]
#mysql -uroot -p -S /data/mysql_3358/run/mysql.sock dbms_monitor
mysql: [Warning] Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 35
Server version: 5.7.29-32-log Percona Server (GPL), Release 32, Revision 56bce88

Copyright (c) 2009-2020 Percona LLC and/or its affiliates
Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

root@localhost 11:07:09 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

结论: 会触发failover, 但failover失败

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
Sat Oct 10 11:07:15 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Sat Oct 10 11:07:15 2020 - [info] Executing SSH check script: exit 0
Sat Oct 10 11:07:15 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Sat Oct 10 11:07:16 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 11:07:16 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 11:07:18 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:07:18 2020 - [warning] Connection failed 2 time(s)..
Sat Oct 10 11:07:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:07:21 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 11:07:24 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:07:24 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 11:07:24 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 11:07:24 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 11:07:24 2020 - [warning] SSH is reachable.
Sat Oct 10 11:07:24 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 11:07:24 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 11:07:24 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:07:24 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:07:25 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:07:25 2020 - [info] Dead Servers:
Sat Oct 10 11:07:25 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:07:25 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:07:25 2020 - [info] Alive Servers:
Sat Oct 10 11:07:25 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:07:25 2020 - [info] Alive Slaves:
Sat Oct 10 11:07:25 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:07:25 2020 - [info] GTID ON
Sat Oct 10 11:07:25 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:07:25 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:07:25 2020 - [info] Checking slave configurations..
Sat Oct 10 11:07:25 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:07:25 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:07:25 2020 - [info] Master is down!
Sat Oct 10 11:07:25 2020 - [info] Terminating monitoring script.
Sat Oct 10 11:07:25 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 11:07:25 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 11:07:25 2020 - [info] Starting master failover.
Sat Oct 10 11:07:25 2020 - [info]
Sat Oct 10 11:07:25 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 11:07:25 2020 - [info]
Sat Oct 10 11:07:26 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:07:26 2020 - [info] Dead Servers:
Sat Oct 10 11:07:26 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:07:26 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:07:26 2020 - [info] Alive Servers:
Sat Oct 10 11:07:26 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:07:26 2020 - [info] Alive Slaves:
Sat Oct 10 11:07:26 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:07:26 2020 - [info] GTID ON
Sat Oct 10 11:07:26 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:07:26 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:07:26 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln492] Server 172.16.120.11(172.16.120.11:3358) is dead, but must be alive! Check server settings.
Sat Oct 10 11:07:26 2020 - [error][/usr/local/share/perl5/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/local/share/perl5/MHA/MasterFailover.pm line 269.

总结: 如不希望slave-1宕机影响failover, 需要在配置文件中对slave-1设置ignore_fail=1

1
2
3
4
5
[server2]
hostname=172.16.120.11
port=3358
candidate_master=1
ignore_fail=1

[用例测试] master挂了, 且slave也有问题2(部分slave io_thread stop)

master挂了, 在此之前slave-1 io_thread stop了

ping_type=CONNECT

启动manager后, 关闭slave-1 io_thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 11:13:58 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 11:13:59 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:13:59 2020 - [info] Dead Servers:
Sat Oct 10 11:13:59 2020 - [info] Alive Servers:
Sat Oct 10 11:13:59 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:13:59 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:13:59 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:13:59 2020 - [info] Alive Slaves:
Sat Oct 10 11:13:59 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:13:59 2020 - [info] GTID ON
Sat Oct 10 11:13:59 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:13:59 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:13:59 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:13:59 2020 - [info] GTID ON
Sat Oct 10 11:13:59 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:13:59 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:13:59 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:13:59 2020 - [info] Checking slave configurations..
Sat Oct 10 11:13:59 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:13:59 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 11:13:59 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:13:59 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 11:13:59 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 11:13:59 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 11:13:59 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 11:13:59 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 11:13:59 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 11:14:00 2020 - [info] OK.
Sat Oct 10 11:14:00 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 11:14:00 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 11:14:00 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 11:14:00 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 11:14:00 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

关闭slave-1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
root@localhost 11:17:29 [dbms_monitor]> stop slave io_thread;
Query OK, 0 rows affected (0.01 sec)

root@localhost 11:17:33 [dbms_monitor]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
root@localhost 11:17:49 [dbms_monitor]> show slave status\G
Master_Log_File: mysql-bin.000011
Read_Master_Log_Pos: 194
Relay_Log_File: mysql-relay-bin.000009
Relay_Log_Pos: 407
Relay_Master_Log_File: mysql-bin.000011
Slave_IO_Running: No
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 194
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

关闭io_thread后 manager仍然正常

关闭master

1
2
3
4
5
6
7
8
9
10
11
12
13
root@localhost 11:21:18 [dbms_monitor]> insert into monitor_delay values(1,now());
Query OK, 1 row affected (0.01 sec)

root@localhost 11:21:39 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
+----+---------------------+
1 row in set (0.00 sec)

root@localhost 11:21:54 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

slave-1

1
2
root@localhost 11:17:49 [dbms_monitor]> select * from monitor_delay;
Empty set (0.01 sec)

slave-2

1
2
3
4
5
6
7
root@localhost 10:20:54 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
+----+---------------------+
1 row in set (0.01 sec)

结论: 会failover且成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
Sat Oct 10 11:22:12 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:22:12 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Sat Oct 10 11:22:12 2020 - [info] Executing SSH check script: exit 0
Sat Oct 10 11:22:12 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 11:22:12 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 11:22:15 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:22:15 2020 - [warning] Connection failed 2 time(s)..
Sat Oct 10 11:22:18 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:22:18 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 11:22:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:22:21 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 11:22:21 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 11:22:21 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 11:22:21 2020 - [warning] SSH is reachable.
Sat Oct 10 11:22:21 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 11:22:21 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 11:22:21 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:22:21 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:22:22 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:22:22 2020 - [info] Dead Servers:
Sat Oct 10 11:22:22 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:22 2020 - [info] Alive Servers:
Sat Oct 10 11:22:22 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:22:22 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:22:22 2020 - [info] Alive Slaves:
Sat Oct 10 11:22:22 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:22 2020 - [info] GTID ON
Sat Oct 10 11:22:22 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:22 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:22 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:22 2020 - [info] GTID ON
Sat Oct 10 11:22:22 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:22 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:22 2020 - [info] Checking slave configurations..
Sat Oct 10 11:22:22 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:22:22 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:22:22 2020 - [info] Master is down!
Sat Oct 10 11:22:22 2020 - [info] Terminating monitoring script.
Sat Oct 10 11:22:22 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 11:22:22 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 11:22:22 2020 - [info] Starting master failover.
Sat Oct 10 11:22:22 2020 - [info]
Sat Oct 10 11:22:22 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 11:22:22 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:22:23 2020 - [info] Dead Servers:
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:23 2020 - [info] Checking master reachability via MySQL(double check)...
Sat Oct 10 11:22:23 2020 - [info] ok.
Sat Oct 10 11:22:23 2020 - [info] Alive Servers:
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:22:23 2020 - [info] Alive Slaves:
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:23 2020 - [info] GTID ON
Sat Oct 10 11:22:23 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:23 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:23 2020 - [info] GTID ON
Sat Oct 10 11:22:23 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:23 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:23 2020 - [info] Starting GTID based failover.
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Oct 10 11:22:23 2020 - [info] Executing master IP deactivation script:
Sat Oct 10 11:22:23 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root
Disabling the VIP on old master: 172.16.120.10
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Sat Oct 10 11:22:23 2020 - [info] done.
Sat Oct 10 11:22:23 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Oct 10 11:22:23 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] * Phase 3: Master Recovery Phase..
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000011:486
Sat Oct 10 11:22:23 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20042-20531
Sat Oct 10 11:22:23 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:23 2020 - [info] GTID ON
Sat Oct 10 11:22:23 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:23 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:23 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000011:194
Sat Oct 10 11:22:23 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20143-20530
Sat Oct 10 11:22:23 2020 - [info] Oldest slaves:
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:23 2020 - [info] GTID ON
Sat Oct 10 11:22:23 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:23 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] * Phase 3.3: Determining New Master Phase..
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] Searching new master from slaves..
Sat Oct 10 11:22:23 2020 - [info] Candidate masters from the configuration file:
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:23 2020 - [info] GTID ON
Sat Oct 10 11:22:23 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:23 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:23 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:22:23 2020 - [info] GTID ON
Sat Oct 10 11:22:23 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:22:23 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:22:23 2020 - [info] Non-candidate masters:
Sat Oct 10 11:22:23 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Oct 10 11:22:23 2020 - [info] New master is 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:22:23 2020 - [info] Starting master failover..
Sat Oct 10 11:22:23 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.12(172.16.120.12:3358) (new master)
+--172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Sat Oct 10 11:22:23 2020 - [info]
Sat Oct 10 11:22:23 2020 - [info] Waiting all logs to be applied..
Sat Oct 10 11:22:23 2020 - [info] done.
Sat Oct 10 11:22:23 2020 - [info] Getting new master's binlog name and position..
Sat Oct 10 11:22:23 2020 - [info] mysql-bin.000007:3182161
Sat Oct 10 11:22:23 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.12', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Sat Oct 10 11:22:23 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000007, 3182161, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20531,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Sat Oct 10 11:22:23 2020 - [info] Executing master IP activate script:
Sat Oct 10 11:22:23 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.12 --new_master_ip=172.16.120.12 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.12
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Sat Oct 10 11:22:24 2020 - [info] OK.
Sat Oct 10 11:22:24 2020 - [info] ** Finished master recovery successfully.
Sat Oct 10 11:22:24 2020 - [info] * Phase 3: Master Recovery Phase completed.
Sat Oct 10 11:22:24 2020 - [info]
Sat Oct 10 11:22:24 2020 - [info] * Phase 4: Slaves Recovery Phase..
Sat Oct 10 11:22:24 2020 - [info]
Sat Oct 10 11:22:24 2020 - [info]
Sat Oct 10 11:22:24 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Sat Oct 10 11:22:24 2020 - [info]
Sat Oct 10 11:22:24 2020 - [info] -- Slave recovery on host 172.16.120.11(172.16.120.11:3358) started, pid: 77208. Check tmp log /masterha/cls_new//172.16.120.11_3358_20201010112222.log if it takes time..
Sat Oct 10 11:22:25 2020 - [info]
Sat Oct 10 11:22:25 2020 - [info] Log messages from 172.16.120.11 ...
Sat Oct 10 11:22:25 2020 - [info]
Sat Oct 10 11:22:24 2020 - [info] Resetting slave 172.16.120.11(172.16.120.11:3358) and starting replication from the new master 172.16.120.12(172.16.120.12:3358)..
Sat Oct 10 11:22:24 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 11:22:24 2020 - [info] Slave started.
Sat Oct 10 11:22:24 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20531,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.11(172.16.120.11:3358). Executed 2 events.
Sat Oct 10 11:22:25 2020 - [info] End of log messages from 172.16.120.11.
Sat Oct 10 11:22:25 2020 - [info] -- Slave on host 172.16.120.11(172.16.120.11:3358) started.
Sat Oct 10 11:22:25 2020 - [info] All new slave servers recovered successfully.
Sat Oct 10 11:22:25 2020 - [info]
Sat Oct 10 11:22:25 2020 - [info] * Phase 5: New master cleanup phase..
Sat Oct 10 11:22:25 2020 - [info]
Sat Oct 10 11:22:25 2020 - [info] Resetting slave info on the new master..
Sat Oct 10 11:22:25 2020 - [info] 172.16.120.12: Resetting slave info succeeded.
Sat Oct 10 11:22:25 2020 - [info] Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 11:22:25 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.12(172.16.120.12:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.12(172.16.120.12:3358) as a new master.
172.16.120.12(172.16.120.12:3358): OK: Applying all logs succeeded.
172.16.120.12(172.16.120.12:3358): OK: Activated master IP address.
172.16.120.11(172.16.120.11:3358): OK: Slave started, replicating from 172.16.120.12(172.16.120.12:3358)
172.16.120.12(172.16.120.12:3358): Resetting slave info succeeded.
Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 11:22:25 2020 - [info] Sending mail..

slave-1正常change到new master slave-2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
root@localhost 11:21:44 [dbms_monitor]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.16.120.12
Master_User: repler
Master_Port: 3358
Connect_Retry: 1
Master_Log_File: mysql-bin.000007
Read_Master_Log_Pos: 3182161
Relay_Log_File: mysql-relay-bin.000002
Relay_Log_Pos: 681
Relay_Master_Log_File: mysql-bin.000007
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 3182161
Relay_Log_Space: 888
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 120123358
Master_UUID: 45e70f96-fcad-11ea-a2f0-0050563108d2
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20531
Executed_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20531,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)

root@localhost 11:24:39 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
+----+---------------------+
1 row in set (0.00 sec)

ping_type=INSERT

启动manager后, 关闭slave-1 io_thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 11:29:28 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 11:29:30 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:29:30 2020 - [info] Dead Servers:
Sat Oct 10 11:29:30 2020 - [info] Alive Servers:
Sat Oct 10 11:29:30 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:29:30 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:29:30 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:29:30 2020 - [info] Alive Slaves:
Sat Oct 10 11:29:30 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:29:30 2020 - [info] GTID ON
Sat Oct 10 11:29:30 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:29:30 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:29:30 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:29:30 2020 - [info] GTID ON
Sat Oct 10 11:29:30 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:29:30 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:29:30 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:29:30 2020 - [info] Checking slave configurations..
Sat Oct 10 11:29:30 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:29:30 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 11:29:30 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:29:30 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 11:29:30 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 11:29:30 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 11:29:30 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 11:29:30 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 11:29:30 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 11:29:30 2020 - [info] OK.
Sat Oct 10 11:29:30 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 11:29:30 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 11:29:30 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 11:29:30 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 11:29:30 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..

关闭slave-1 io_thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
root@localhost 11:30:30 [dbms_monitor]> stop slave io_thread;
Query OK, 0 rows affected (0.00 sec)

root@localhost 11:30:43 [dbms_monitor]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
root@localhost 11:30:45 [dbms_monitor]> show slave status\G
Master_Log_File: mysql-bin.000012
Read_Master_Log_Pos: 18307
Relay_Log_File: mysql-relay-bin.000002
Relay_Log_Pos: 18480
Relay_Master_Log_File: mysql-bin.000012
Slave_IO_Running: No
Slave_SQL_Running: Yes
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 18307
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

关闭io_thread后 manager仍然正常

关闭master

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@localhost 11:31:40 [dbms_monitor]> insert into monitor_delay values(2,now());
Query OK, 1 row affected (0.00 sec)

root@localhost 11:31:45 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
+----+---------------------+
2 rows in set (0.00 sec)

root@localhost 11:31:51 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

slave-1

1
2
3
4
5
6
7
root@localhost 11:30:45 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
+----+---------------------+
1 row in set (0.00 sec)

slave-2

1
2
3
4
5
6
7
8
root@localhost 11:27:07 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
+----+---------------------+
2 rows in set (0.00 sec)

结论: 会failover且成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
Sat Oct 10 11:33:09 2020 - [warning] SSH is reachable.
Sat Oct 10 11:33:09 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 11:33:09 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 11:33:09 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:33:09 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:33:10 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:33:10 2020 - [info] Dead Servers:
Sat Oct 10 11:33:10 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:10 2020 - [info] Alive Servers:
Sat Oct 10 11:33:10 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:33:10 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:33:10 2020 - [info] Alive Slaves:
Sat Oct 10 11:33:10 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:10 2020 - [info] GTID ON
Sat Oct 10 11:33:10 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:10 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:10 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:10 2020 - [info] GTID ON
Sat Oct 10 11:33:10 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:10 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:10 2020 - [info] Checking slave configurations..
Sat Oct 10 11:33:10 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:33:10 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:33:10 2020 - [info] Master is down!
Sat Oct 10 11:33:10 2020 - [info] Terminating monitoring script.
Sat Oct 10 11:33:10 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 11:33:10 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 11:33:10 2020 - [info] Starting master failover.
Sat Oct 10 11:33:10 2020 - [info]
Sat Oct 10 11:33:10 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 11:33:10 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:33:11 2020 - [info] Dead Servers:
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:11 2020 - [info] Alive Servers:
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:33:11 2020 - [info] Alive Slaves:
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:11 2020 - [info] GTID ON
Sat Oct 10 11:33:11 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:11 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:11 2020 - [info] GTID ON
Sat Oct 10 11:33:11 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:11 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:11 2020 - [info] Starting GTID based failover.
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Oct 10 11:33:11 2020 - [info] Executing master IP deactivation script:
Sat Oct 10 11:33:11 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root
Disabling the VIP on old master: 172.16.120.10
RTNETLINK answers: Cannot assign requested address
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Sat Oct 10 11:33:11 2020 - [info] done.
Sat Oct 10 11:33:11 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Oct 10 11:33:11 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] * Phase 3: Master Recovery Phase..
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000012:50590
Sat Oct 10 11:33:11 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20532-20745
Sat Oct 10 11:33:11 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:11 2020 - [info] GTID ON
Sat Oct 10 11:33:11 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:11 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:11 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000012:18307
Sat Oct 10 11:33:11 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20532-20608
Sat Oct 10 11:33:11 2020 - [info] Oldest slaves:
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:11 2020 - [info] GTID ON
Sat Oct 10 11:33:11 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:11 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] * Phase 3.3: Determining New Master Phase..
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] Searching new master from slaves..
Sat Oct 10 11:33:11 2020 - [info] Candidate masters from the configuration file:
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:11 2020 - [info] GTID ON
Sat Oct 10 11:33:11 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:11 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:11 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:33:11 2020 - [info] GTID ON
Sat Oct 10 11:33:11 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:33:11 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:33:11 2020 - [info] Non-candidate masters:
Sat Oct 10 11:33:11 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Oct 10 11:33:11 2020 - [info] New master is 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:33:11 2020 - [info] Starting master failover..
Sat Oct 10 11:33:11 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.12(172.16.120.12:3358) (new master)
+--172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] Waiting all logs to be applied..
Sat Oct 10 11:33:11 2020 - [info] done.
Sat Oct 10 11:33:11 2020 - [info] Getting new master's binlog name and position..
Sat Oct 10 11:33:11 2020 - [info] mysql-bin.000007:3232182
Sat Oct 10 11:33:11 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.12', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Sat Oct 10 11:33:11 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000007, 3232182, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20745,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Sat Oct 10 11:33:11 2020 - [info] Executing master IP activate script:
Sat Oct 10 11:33:11 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.12 --new_master_ip=172.16.120.12 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.12
RTNETLINK answers: File exists
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Sat Oct 10 11:33:11 2020 - [info] OK.
Sat Oct 10 11:33:11 2020 - [info] ** Finished master recovery successfully.
Sat Oct 10 11:33:11 2020 - [info] * Phase 3: Master Recovery Phase completed.
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] * Phase 4: Slaves Recovery Phase..
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Sat Oct 10 11:33:11 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] -- Slave recovery on host 172.16.120.11(172.16.120.11:3358) started, pid: 78319. Check tmp log /masterha/cls_new//172.16.120.11_3358_20201010113310.log if it takes time..
Sat Oct 10 11:33:12 2020 - [info]
Sat Oct 10 11:33:12 2020 - [info] Log messages from 172.16.120.11 ...
Sat Oct 10 11:33:12 2020 - [info]
Sat Oct 10 11:33:11 2020 - [info] Resetting slave 172.16.120.11(172.16.120.11:3358) and starting replication from the new master 172.16.120.12(172.16.120.12:3358)..
Sat Oct 10 11:33:11 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 11:33:11 2020 - [info] Slave started.
Sat Oct 10 11:33:12 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20745,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.11(172.16.120.11:3358). Executed 2 events.
Sat Oct 10 11:33:12 2020 - [info] End of log messages from 172.16.120.11.
Sat Oct 10 11:33:12 2020 - [info] -- Slave on host 172.16.120.11(172.16.120.11:3358) started.
Sat Oct 10 11:33:12 2020 - [info] All new slave servers recovered successfully.
Sat Oct 10 11:33:12 2020 - [info]
Sat Oct 10 11:33:12 2020 - [info] * Phase 5: New master cleanup phase..
Sat Oct 10 11:33:12 2020 - [info]
Sat Oct 10 11:33:12 2020 - [info] Resetting slave info on the new master..
Sat Oct 10 11:33:13 2020 - [info] 172.16.120.12: Resetting slave info succeeded.
Sat Oct 10 11:33:13 2020 - [info] Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 11:33:13 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.12(172.16.120.12:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.12(172.16.120.12:3358) as a new master.
172.16.120.12(172.16.120.12:3358): OK: Applying all logs succeeded.
172.16.120.12(172.16.120.12:3358): OK: Activated master IP address.
172.16.120.11(172.16.120.11:3358): OK: Slave started, replicating from 172.16.120.12(172.16.120.12:3358)
172.16.120.12(172.16.120.12:3358): Resetting slave info succeeded.
Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 11:33:13 2020 - [info] Sending mail..

slave-1已经change到了slave-2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
root@localhost 11:32:04 [dbms_monitor]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.16.120.12
Master_User: repler
Master_Port: 3358
Connect_Retry: 1
Master_Log_File: mysql-bin.000007
Read_Master_Log_Pos: 3232182
Relay_Log_File: mysql-relay-bin.000002
Relay_Log_Pos: 32447
Relay_Master_Log_File: mysql-bin.000007
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 3232182
Relay_Log_Space: 32654
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 120123358
Master_UUID: 45e70f96-fcad-11ea-a2f0-0050563108d2
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20609-20745
Executed_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20745,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)

root@localhost 11:34:29 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
+----+---------------------+
2 rows in set (0.00 sec)

[用例测试] master挂了, 且slave也有问题3(部分slave io_thread error)

master挂了, 在此之前slave-1 io_thread error了

ping_type=CONNECT

启动manager后, 调整master防火墙, 禁止slave-1访问

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 11:58:40 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 11:58:41 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:58:41 2020 - [info] Dead Servers:
Sat Oct 10 11:58:41 2020 - [info] Alive Servers:
Sat Oct 10 11:58:41 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:58:41 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:58:41 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:58:41 2020 - [info] Alive Slaves:
Sat Oct 10 11:58:41 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:58:41 2020 - [info] GTID ON
Sat Oct 10 11:58:41 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:58:41 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:58:41 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:58:41 2020 - [info] GTID ON
Sat Oct 10 11:58:41 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:58:41 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:58:41 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:58:41 2020 - [info] Checking slave configurations..
Sat Oct 10 11:58:41 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:58:41 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 11:58:41 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:58:41 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 11:58:41 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 11:58:41 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 11:58:41 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 11:58:41 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 11:58:41 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 11:58:41 2020 - [info] OK.
Sat Oct 10 11:58:41 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 11:58:41 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 11:58:41 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 11:58:41 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 11:58:41 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

调整master防火墙, 禁止slave-1访问
1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.13 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

kill slave-1 io_thread
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
root@localhost 12:04:35 [dbms_monitor]> show processlist;
+-----+----------+---------------------+--------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+-----+----------+---------------------+--------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| 2 | proxysql | 172.16.120.12:33384 | NULL | Sleep | 9 | | NULL | 0 | 0 |
| 3 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 4 | proxysql | 172.16.120.10:34072 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 6 | proxysql | 172.16.120.11:35090 | NULL | Sleep | 2 | | NULL | 0 | 0 |
| 7 | repler | 172.16.120.11:35092 | NULL | Binlog Dump GTID | 485 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 8 | repler | 172.16.120.12:33392 | NULL | Binlog Dump GTID | 472 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 110 | proxysql | 172.16.120.10:34094 | NULL | Sleep | 4 | | NULL | 1 | 0 |
+-----+----------+---------------------+--------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
7 rows in set (0.00 sec)

root@localhost 12:04:46 [dbms_monitor]> kill 7;
Query OK, 0 rows affected (0.00 sec)

root@localhost 12:05:20 [dbms_monitor]> insert into monitor_delay values(6,now());
Query OK, 1 row affected (0.00 sec)

slave-1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
root@localhost 12:05:02 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
| 3 | 2020-10-10 11:51:01 |
| 4 | 2020-10-10 11:59:15 |
| 5 | 2020-10-10 12:04:35 |
+----+---------------------+
5 rows in set (0.00 sec)


root@localhost 12:07:32 [dbms_monitor]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Reconnecting after a failed master event read
Master_Host: 172.16.120.10
Master_User: repler
Master_Port: 3358
Connect_Retry: 1
Master_Log_File: mysql-bin.000014
Read_Master_Log_Pos: 778
Relay_Log_File: mysql-relay-bin.000003
Relay_Log_Pos: 605
Relay_Master_Log_File: mysql-bin.000014
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 778
Relay_Log_Space: 1317
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 2003
Last_IO_Error: error reconnecting to master 'repler@172.16.120.10:3358' - retry-time: 1 retries: 2
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 120103358
Master_UUID: 44a4ea53-fcad-11ea-bd16-0050563b7b42
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 201010 12:06:57
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20747-20748
Executed_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20748,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)

slave-2

1
2
3
4
5
6
7
8
9
10
11
12
root@localhost 12:04:42 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
| 3 | 2020-10-10 11:51:01 |
| 4 | 2020-10-10 11:59:15 |
| 5 | 2020-10-10 12:04:35 |
| 6 | 2020-10-10 12:05:30 |
+----+---------------------+
6 rows in set (0.00 sec)

关闭master

1
2
root@localhost 12:21:24 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

结论: 会failover且成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
Sat Oct 10 12:11:18 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 12:11:18 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Sat Oct 10 12:11:18 2020 - [info] Executing SSH check script: exit 0
Sat Oct 10 12:11:19 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 12:11:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 12:11:21 2020 - [warning] Connection failed 2 time(s)..
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 12:11:24 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 12:11:24 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 12:11:24 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 12:11:27 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 12:11:27 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 12:11:27 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 12:11:27 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 12:11:27 2020 - [warning] SSH is reachable.
Sat Oct 10 12:11:27 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 12:11:27 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 12:11:27 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 12:11:27 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 12:11:28 2020 - [info] GTID failover mode = 1
Sat Oct 10 12:11:28 2020 - [info] Dead Servers:
Sat Oct 10 12:11:28 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:28 2020 - [info] Alive Servers:
Sat Oct 10 12:11:28 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 12:11:28 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 12:11:28 2020 - [info] Alive Slaves:
Sat Oct 10 12:11:28 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:28 2020 - [info] GTID ON
Sat Oct 10 12:11:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:28 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:28 2020 - [info] GTID ON
Sat Oct 10 12:11:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:28 2020 - [info] Checking slave configurations..
Sat Oct 10 12:11:28 2020 - [info] Checking replication filtering settings..
Sat Oct 10 12:11:28 2020 - [info] Replication filtering check ok.
Sat Oct 10 12:11:28 2020 - [info] Master is down!
Sat Oct 10 12:11:28 2020 - [info] Terminating monitoring script.
Sat Oct 10 12:11:28 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 12:11:28 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 12:11:28 2020 - [info] Starting master failover.
Sat Oct 10 12:11:28 2020 - [info]
Sat Oct 10 12:11:28 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 12:11:28 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] GTID failover mode = 1
Sat Oct 10 12:11:29 2020 - [info] Dead Servers:
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:29 2020 - [info] Checking master reachability via MySQL(double check)...
Sat Oct 10 12:11:29 2020 - [info] ok.
Sat Oct 10 12:11:29 2020 - [info] Alive Servers:
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 12:11:29 2020 - [info] Alive Slaves:
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:29 2020 - [info] GTID ON
Sat Oct 10 12:11:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:29 2020 - [info] GTID ON
Sat Oct 10 12:11:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:29 2020 - [info] Starting GTID based failover.
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Oct 10 12:11:29 2020 - [info] Executing master IP deactivation script:
Sat Oct 10 12:11:29 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root
Disabling the VIP on old master: 172.16.120.10
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Sat Oct 10 12:11:29 2020 - [info] done.
Sat Oct 10 12:11:29 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Oct 10 12:11:29 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] * Phase 3: Master Recovery Phase..
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000014:1070
Sat Oct 10 12:11:29 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20747-20749
Sat Oct 10 12:11:29 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:29 2020 - [info] GTID ON
Sat Oct 10 12:11:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:29 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000014:778
Sat Oct 10 12:11:29 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20747-20748
Sat Oct 10 12:11:29 2020 - [info] Oldest slaves:
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:29 2020 - [info] GTID ON
Sat Oct 10 12:11:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] * Phase 3.3: Determining New Master Phase..
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] Searching new master from slaves..
Sat Oct 10 12:11:29 2020 - [info] Candidate masters from the configuration file:
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:29 2020 - [info] GTID ON
Sat Oct 10 12:11:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:29 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:11:29 2020 - [info] GTID ON
Sat Oct 10 12:11:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:11:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:11:29 2020 - [info] Non-candidate masters:
Sat Oct 10 12:11:29 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Oct 10 12:11:29 2020 - [info] New master is 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 12:11:29 2020 - [info] Starting master failover..
Sat Oct 10 12:11:29 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.12(172.16.120.12:3358) (new master)
+--172.16.120.11(172.16.120.11:3358)
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Sat Oct 10 12:11:29 2020 - [info]
Sat Oct 10 12:11:29 2020 - [info] Waiting all logs to be applied..
Sat Oct 10 12:11:29 2020 - [info] done.
Sat Oct 10 12:11:29 2020 - [info] Getting new master's binlog name and position..
Sat Oct 10 12:11:29 2020 - [info] mysql-bin.000007:3233250
Sat Oct 10 12:11:29 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.12', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Sat Oct 10 12:11:29 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000007, 3233250, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20749,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Sat Oct 10 12:11:29 2020 - [info] Executing master IP activate script:
Sat Oct 10 12:11:29 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.12 --new_master_ip=172.16.120.12 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.12
RTNETLINK answers: File exists
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Sat Oct 10 12:11:30 2020 - [info] OK.
Sat Oct 10 12:11:30 2020 - [info] ** Finished master recovery successfully.
Sat Oct 10 12:11:30 2020 - [info] * Phase 3: Master Recovery Phase completed.
Sat Oct 10 12:11:30 2020 - [info]
Sat Oct 10 12:11:30 2020 - [info] * Phase 4: Slaves Recovery Phase..
Sat Oct 10 12:11:30 2020 - [info]
Sat Oct 10 12:11:30 2020 - [info]
Sat Oct 10 12:11:30 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Sat Oct 10 12:11:30 2020 - [info]
Sat Oct 10 12:11:30 2020 - [info] -- Slave recovery on host 172.16.120.11(172.16.120.11:3358) started, pid: 81557. Check tmp log /masterha/cls_new//172.16.120.11_3358_20201010121128.log if it takes time..
Sat Oct 10 12:11:31 2020 - [info]
Sat Oct 10 12:11:31 2020 - [info] Log messages from 172.16.120.11 ...
Sat Oct 10 12:11:31 2020 - [info]
Sat Oct 10 12:11:30 2020 - [info] Resetting slave 172.16.120.11(172.16.120.11:3358) and starting replication from the new master 172.16.120.12(172.16.120.12:3358)..
Sat Oct 10 12:11:30 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 12:11:30 2020 - [info] Slave started.
Sat Oct 10 12:11:30 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20749,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.11(172.16.120.11:3358). Executed 2 events.
Sat Oct 10 12:11:31 2020 - [info] End of log messages from 172.16.120.11.
Sat Oct 10 12:11:31 2020 - [info] -- Slave on host 172.16.120.11(172.16.120.11:3358) started.
Sat Oct 10 12:11:31 2020 - [info] All new slave servers recovered successfully.
Sat Oct 10 12:11:31 2020 - [info]
Sat Oct 10 12:11:31 2020 - [info] * Phase 5: New master cleanup phase..
Sat Oct 10 12:11:31 2020 - [info]
Sat Oct 10 12:11:31 2020 - [info] Resetting slave info on the new master..
Sat Oct 10 12:11:31 2020 - [info] 172.16.120.12: Resetting slave info succeeded.
Sat Oct 10 12:11:31 2020 - [info] Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 12:11:31 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.12(172.16.120.12:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.12(172.16.120.12:3358) as a new master.
172.16.120.12(172.16.120.12:3358): OK: Applying all logs succeeded.
172.16.120.12(172.16.120.12:3358): OK: Activated master IP address.
172.16.120.11(172.16.120.11:3358): OK: Slave started, replicating from 172.16.120.12(172.16.120.12:3358)
172.16.120.12(172.16.120.12:3358): Resetting slave info succeeded.
Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 12:11:31 2020 - [info] Sending mail..

ping_type=INSERT

启动manager后, 调整master防火墙, 禁止slave-1访问

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 12:14:59 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 12:15:00 2020 - [info] GTID failover mode = 1
Sat Oct 10 12:15:00 2020 - [info] Dead Servers:
Sat Oct 10 12:15:00 2020 - [info] Alive Servers:
Sat Oct 10 12:15:00 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:15:00 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 12:15:00 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 12:15:00 2020 - [info] Alive Slaves:
Sat Oct 10 12:15:00 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:15:00 2020 - [info] GTID ON
Sat Oct 10 12:15:00 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:15:00 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:15:00 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:15:00 2020 - [info] GTID ON
Sat Oct 10 12:15:00 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:15:00 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:15:00 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:15:00 2020 - [info] Checking slave configurations..
Sat Oct 10 12:15:00 2020 - [info] Checking replication filtering settings..
Sat Oct 10 12:15:00 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 12:15:00 2020 - [info] Replication filtering check ok.
Sat Oct 10 12:15:00 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 12:15:00 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 12:15:00 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 12:15:00 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 12:15:00 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 12:15:00 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 12:15:00 2020 - [info] OK.
Sat Oct 10 12:15:00 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 12:15:00 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 12:15:00 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 12:15:00 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 12:15:01 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..

调整master防火墙, 禁止slave-1访问

1
2
3
4
5
6
7
IPTABLES="/sbin/iptables"
$IPTABLES -F
$IPTABLES -A INPUT -p icmp --icmp-type any -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.10 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.13 -j ACCEPT
$IPTABLES -A INPUT -p tcp -s 172.16.120.12 -j ACCEPT
$IPTABLES -A INPUT -p tcp --syn -j DROP

在master kill slave-1 io_thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
root@localhost 12:13:11 [dbms_monitor]> show processlist;
+----+----------+---------------------+--------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+----------+---------------------+--------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
| 2 | proxysql | 172.16.120.12:33486 | NULL | Sleep | 9 | | NULL | 0 | 0 |
| 3 | root | localhost | dbms_monitor | Query | 0 | starting | show processlist | 0 | 0 |
| 4 | proxysql | 172.16.120.10:34148 | NULL | Sleep | 4 | | NULL | 0 | 0 |
| 5 | proxysql | 172.16.120.11:35190 | NULL | Sleep | 2 | | NULL | 0 | 0 |
| 8 | repler | 172.16.120.11:35200 | NULL | Binlog Dump GTID | 428 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 9 | repler | 172.16.120.12:33490 | NULL | Binlog Dump GTID | 379 | Master has sent all binlog to slave; waiting for more updates | NULL | 0 | 0 |
| 13 | mha | 172.16.120.13:40526 | NULL | Sleep | 0 | | NULL | 0 | 0 |
| 17 | proxysql | 172.16.120.10:34164 | NULL | Sleep | 14 | | NULL | 1 | 0 |
+----+----------+---------------------+--------------+------------------+------+---------------------------------------------------------------+------------------+-----------+---------------+
8 rows in set (0.00 sec)

root@localhost 12:20:46 [dbms_monitor]> kill 8;
Query OK, 0 rows affected (0.00 sec)

root@localhost 12:20:59 [dbms_monitor]> truncate table monitor_delay;
Query OK, 0 rows affected (0.01 sec)

root@localhost 12:21:10 [dbms_monitor]> insert into monitor_delay values(88,now());
Query OK, 1 row affected (0.00 sec)

root@localhost 12:21:17 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 88 | 2020-10-10 12:21:17 |
+----+---------------------+
1 row in set (0.00 sec)

slave-1

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
root@localhost 12:21:29 [dbms_monitor]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Reconnecting after a failed master event read
Master_Host: 172.16.120.10
Master_User: repler
Master_Port: 3358
Connect_Retry: 1
Master_Log_File: mysql-bin.000015
Read_Master_Log_Pos: 85472
Relay_Log_File: mysql-relay-bin.000002
Relay_Log_Pos: 85645
Relay_Master_Log_File: mysql-bin.000015
Slave_IO_Running: Connecting
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 85472
Relay_Log_Space: 85852
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 2003
Last_IO_Error: error reconnecting to master 'repler@172.16.120.10:3358' - retry-time: 1 retries: 1
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 120103358
Master_UUID: 44a4ea53-fcad-11ea-bd16-0050563b7b42
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 201010 12:21:59
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20750-21111
Executed_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-21111,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)

root@localhost 12:22:03 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
| 3 | 2020-10-10 11:51:01 |
| 4 | 2020-10-10 11:59:15 |
| 5 | 2020-10-10 12:04:35 |
| 6 | 2020-10-10 12:05:30 |
+----+---------------------+
6 rows in set (0.00 sec)

slave-2

1
2
3
4
5
6
7
root@localhost 12:21:32 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 88 | 2020-10-10 12:21:17 |
+----+---------------------+
1 row in set (0.00 sec)

关闭master

1
2
root@localhost 12:21:24 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

结论: 会failover且成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
Sat Oct 10 12:22:43 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Sat Oct 10 12:22:43 2020 - [info] Executing SSH check script: exit 0
Sat Oct 10 12:22:43 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Sat Oct 10 12:22:43 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 12:22:46 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 12:22:46 2020 - [warning] Connection failed 2 time(s)..
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 12:22:48 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 12:22:49 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 12:22:49 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 12:22:52 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 12:22:52 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 12:22:52 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 12:22:52 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 12:22:52 2020 - [warning] SSH is reachable.
Sat Oct 10 12:22:52 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 12:22:52 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 12:22:52 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 12:22:52 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 12:22:53 2020 - [info] GTID failover mode = 1
Sat Oct 10 12:22:53 2020 - [info] Dead Servers:
Sat Oct 10 12:22:53 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:53 2020 - [info] Alive Servers:
Sat Oct 10 12:22:53 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 12:22:53 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 12:22:53 2020 - [info] Alive Slaves:
Sat Oct 10 12:22:53 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:53 2020 - [info] GTID ON
Sat Oct 10 12:22:53 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:53 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:53 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:53 2020 - [info] GTID ON
Sat Oct 10 12:22:53 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:53 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:53 2020 - [info] Checking slave configurations..
Sat Oct 10 12:22:53 2020 - [info] Checking replication filtering settings..
Sat Oct 10 12:22:53 2020 - [info] Replication filtering check ok.
Sat Oct 10 12:22:53 2020 - [info] Master is down!
Sat Oct 10 12:22:53 2020 - [info] Terminating monitoring script.
Sat Oct 10 12:22:53 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 12:22:53 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 12:22:53 2020 - [info] Starting master failover.
Sat Oct 10 12:22:53 2020 - [info]
Sat Oct 10 12:22:53 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 12:22:53 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] GTID failover mode = 1
Sat Oct 10 12:22:54 2020 - [info] Dead Servers:
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:54 2020 - [info] Alive Servers:
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 12:22:54 2020 - [info] Alive Slaves:
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:54 2020 - [info] GTID ON
Sat Oct 10 12:22:54 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:54 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:54 2020 - [info] GTID ON
Sat Oct 10 12:22:54 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:54 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:54 2020 - [info] Starting GTID based failover.
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Oct 10 12:22:54 2020 - [info] Executing master IP deactivation script:
Sat Oct 10 12:22:54 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root
Disabling the VIP on old master: 172.16.120.10
RTNETLINK answers: Cannot assign requested address
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Sat Oct 10 12:22:54 2020 - [info] done.
Sat Oct 10 12:22:54 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Oct 10 12:22:54 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] * Phase 3: Master Recovery Phase..
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000015:110146
Sat Oct 10 12:22:54 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20750-21216
Sat Oct 10 12:22:54 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:54 2020 - [info] GTID ON
Sat Oct 10 12:22:54 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:54 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:54 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000015:85472
Sat Oct 10 12:22:54 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20750-21111
Sat Oct 10 12:22:54 2020 - [info] Oldest slaves:
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:54 2020 - [info] GTID ON
Sat Oct 10 12:22:54 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:54 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] * Phase 3.3: Determining New Master Phase..
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] Searching new master from slaves..
Sat Oct 10 12:22:54 2020 - [info] Candidate masters from the configuration file:
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:54 2020 - [info] GTID ON
Sat Oct 10 12:22:54 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:54 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:54 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 12:22:54 2020 - [info] GTID ON
Sat Oct 10 12:22:54 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 12:22:54 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 12:22:54 2020 - [info] Non-candidate masters:
Sat Oct 10 12:22:54 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Oct 10 12:22:54 2020 - [info] New master is 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 12:22:54 2020 - [info] Starting master failover..
Sat Oct 10 12:22:54 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.12(172.16.120.12:3358) (new master)
+--172.16.120.11(172.16.120.11:3358)
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] Waiting all logs to be applied..
Sat Oct 10 12:22:54 2020 - [info] done.
Sat Oct 10 12:22:54 2020 - [info] Getting new master's binlog name and position..
Sat Oct 10 12:22:54 2020 - [info] mysql-bin.000007:3342407
Sat Oct 10 12:22:54 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.12', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Sat Oct 10 12:22:54 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000007, 3342407, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-21216,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Sat Oct 10 12:22:54 2020 - [info] Executing master IP activate script:
Sat Oct 10 12:22:54 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.12 --new_master_ip=172.16.120.12 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.12
RTNETLINK answers: File exists
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Sat Oct 10 12:22:54 2020 - [info] OK.
Sat Oct 10 12:22:54 2020 - [info] ** Finished master recovery successfully.
Sat Oct 10 12:22:54 2020 - [info] * Phase 3: Master Recovery Phase completed.
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] * Phase 4: Slaves Recovery Phase..
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Sat Oct 10 12:22:54 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] -- Slave recovery on host 172.16.120.11(172.16.120.11:3358) started, pid: 82756. Check tmp log /masterha/cls_new//172.16.120.11_3358_20201010122253.log if it takes time..
Sat Oct 10 12:22:55 2020 - [info]
Sat Oct 10 12:22:55 2020 - [info] Log messages from 172.16.120.11 ...
Sat Oct 10 12:22:55 2020 - [info]
Sat Oct 10 12:22:54 2020 - [info] Resetting slave 172.16.120.11(172.16.120.11:3358) and starting replication from the new master 172.16.120.12(172.16.120.12:3358)..
Sat Oct 10 12:22:54 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 12:22:54 2020 - [info] Slave started.
Sat Oct 10 12:22:55 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-21216,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.11(172.16.120.11:3358). Executed 2 events.
Sat Oct 10 12:22:55 2020 - [info] End of log messages from 172.16.120.11.
Sat Oct 10 12:22:55 2020 - [info] -- Slave on host 172.16.120.11(172.16.120.11:3358) started.
Sat Oct 10 12:22:55 2020 - [info] All new slave servers recovered successfully.
Sat Oct 10 12:22:55 2020 - [info]
Sat Oct 10 12:22:55 2020 - [info] * Phase 5: New master cleanup phase..
Sat Oct 10 12:22:55 2020 - [info]
Sat Oct 10 12:22:55 2020 - [info] Resetting slave info on the new master..
Sat Oct 10 12:22:55 2020 - [info] 172.16.120.12: Resetting slave info succeeded.
Sat Oct 10 12:22:55 2020 - [info] Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 12:22:55 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.12(172.16.120.12:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.12(172.16.120.12:3358) as a new master.
172.16.120.12(172.16.120.12:3358): OK: Applying all logs succeeded.
172.16.120.12(172.16.120.12:3358): OK: Activated master IP address.
172.16.120.11(172.16.120.11:3358): OK: Slave started, replicating from 172.16.120.12(172.16.120.12:3358)
172.16.120.12(172.16.120.12:3358): Resetting slave info succeeded.
Master failover to 172.16.120.12(172.16.120.12:3358) completed successfully.
Sat Oct 10 12:22:55 2020 - [info] Sending mail..

[用例测试] master挂了, 且slave也有问题4(部分slave sql_thread stop)

master挂了, 在此之前slave-1 sql_thread stop了

ping_type=CONNECT

启动manager后, 关闭slave-1 sql_thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 11:43:34 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 11:43:35 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:43:35 2020 - [info] Dead Servers:
Sat Oct 10 11:43:35 2020 - [info] Alive Servers:
Sat Oct 10 11:43:35 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:43:35 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:43:35 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:43:35 2020 - [info] Alive Slaves:
Sat Oct 10 11:43:35 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:43:35 2020 - [info] GTID ON
Sat Oct 10 11:43:35 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:43:35 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:43:35 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:43:35 2020 - [info] GTID ON
Sat Oct 10 11:43:35 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:43:35 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:43:35 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:43:35 2020 - [info] Checking slave configurations..
Sat Oct 10 11:43:35 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:43:35 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 11:43:35 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:43:35 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 11:43:35 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 11:43:35 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 11:43:35 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 11:43:35 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 11:43:35 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 11:43:35 2020 - [info] OK.
Sat Oct 10 11:43:35 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 11:43:35 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 11:43:35 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 11:43:35 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 11:43:35 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

关闭slave-1 sql_thread
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
root@localhost 11:44:26 [dbms_monitor]> stop slave sql_thread;
Query OK, 0 rows affected (0.01 sec)

root@localhost 11:50:00 [dbms_monitor]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
root@localhost 11:50:04 [dbms_monitor]> show slave status\G
Master_Log_File: mysql-bin.000013
Read_Master_Log_Pos: 194
Relay_Log_File: mysql-relay-bin.000002
Relay_Log_Pos: 367
Relay_Master_Log_File: mysql-bin.000013
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 194
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

关闭sql_thread后 manager仍然正常

关闭master

1
2
3
4
5
root@localhost 11:40:00 [dbms_monitor]> insert into monitor_delay values(3,now());
Query OK, 1 row affected (0.00 sec)

root@localhost 11:51:01 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

slave-1
1
2
3
4
5
6
7
8
root@localhost 11:50:04 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
+----+---------------------+
2 rows in set (0.00 sec)

slave-2
1
2
3
4
5
6
7
8
9
root@localhost 11:36:17 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
| 3 | 2020-10-10 11:51:01 |
+----+---------------------+
3 rows in set (0.00 sec)

结论: 会failover且成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
Sat Oct 10 11:51:18 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:51:18 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Sat Oct 10 11:51:18 2020 - [info] Executing SSH check script: exit 0
Sat Oct 10 11:51:18 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 11:51:18 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 11:51:21 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:51:21 2020 - [warning] Connection failed 2 time(s)..
Sat Oct 10 11:51:24 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:51:24 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 11:51:27 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 11:51:27 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 11:51:27 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 11:51:27 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 11:51:27 2020 - [warning] SSH is reachable.
Sat Oct 10 11:51:27 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 11:51:27 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 11:51:27 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:51:27 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 11:51:28 2020 - [warning] SQL Thread is stopped(no error) on 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:51:28 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:51:28 2020 - [info] Dead Servers:
Sat Oct 10 11:51:28 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:28 2020 - [info] Alive Servers:
Sat Oct 10 11:51:28 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:51:28 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:51:28 2020 - [info] Alive Slaves:
Sat Oct 10 11:51:28 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:28 2020 - [info] GTID ON
Sat Oct 10 11:51:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:28 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:28 2020 - [info] GTID ON
Sat Oct 10 11:51:28 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:28 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:28 2020 - [info] Checking slave configurations..
Sat Oct 10 11:51:28 2020 - [info] Checking replication filtering settings..
Sat Oct 10 11:51:28 2020 - [info] Replication filtering check ok.
Sat Oct 10 11:51:28 2020 - [info] Master is down!
Sat Oct 10 11:51:28 2020 - [info] Terminating monitoring script.
Sat Oct 10 11:51:28 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 11:51:28 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 11:51:28 2020 - [info] Starting master failover.
Sat Oct 10 11:51:28 2020 - [info]
Sat Oct 10 11:51:28 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 11:51:28 2020 - [info]
Sat Oct 10 11:51:29 2020 - [warning] SQL Thread is stopped(no error) on 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:51:29 2020 - [info] GTID failover mode = 1
Sat Oct 10 11:51:29 2020 - [info] Dead Servers:
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Checking master reachability via MySQL(double check)...
Sat Oct 10 11:51:29 2020 - [info] ok.
Sat Oct 10 11:51:29 2020 - [info] Alive Servers:
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:51:29 2020 - [info] Alive Slaves:
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info] Starting SQL thread on 172.16.120.11(172.16.120.11:3358) ..
Sat Oct 10 11:51:29 2020 - [info] done.
Sat Oct 10 11:51:29 2020 - [info] Starting GTID based failover.
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Oct 10 11:51:29 2020 - [info] Executing master IP deactivation script:
Sat Oct 10 11:51:29 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root
Disabling the VIP on old master: 172.16.120.10
RTNETLINK answers: Cannot assign requested address
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Sat Oct 10 11:51:29 2020 - [info] done.
Sat Oct 10 11:51:29 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Oct 10 11:51:29 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] * Phase 3: Master Recovery Phase..
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000013:486
Sat Oct 10 11:51:29 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20746
Sat Oct 10 11:51:29 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000013:486
Sat Oct 10 11:51:29 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:20746
Sat Oct 10 11:51:29 2020 - [info] Oldest slaves:
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] * Phase 3.3: Determining New Master Phase..
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] Searching new master from slaves..
Sat Oct 10 11:51:29 2020 - [info] Candidate masters from the configuration file:
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 11:51:29 2020 - [info] GTID ON
Sat Oct 10 11:51:29 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 11:51:29 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 11:51:29 2020 - [info] Non-candidate masters:
Sat Oct 10 11:51:29 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Oct 10 11:51:29 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 11:51:29 2020 - [info] Starting master failover..
Sat Oct 10 11:51:29 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
+--172.16.120.12(172.16.120.12:3358)
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] Waiting all logs to be applied..
Sat Oct 10 11:51:29 2020 - [info] done.
Sat Oct 10 11:51:29 2020 - [info] Replicating from the latest slave 172.16.120.12(172.16.120.12:3358) and waiting to apply..
Sat Oct 10 11:51:29 2020 - [info] Waiting all logs to be applied on the latest slave..
Sat Oct 10 11:51:29 2020 - [info] Resetting slave 172.16.120.11(172.16.120.11:3358) and starting replication from the new master 172.16.120.12(172.16.120.12:3358)..
Sat Oct 10 11:51:29 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 11:51:29 2020 - [info] Slave started.
Sat Oct 10 11:51:29 2020 - [info] Waiting to execute all relay logs on 172.16.120.11(172.16.120.11:3358)..
Sat Oct 10 11:51:29 2020 - [info] master_pos_wait(mysql-bin.000007:3232449) completed on 172.16.120.11(172.16.120.11:3358). Executed 1 events.
Sat Oct 10 11:51:29 2020 - [info] done.
Sat Oct 10 11:51:29 2020 - [info] done.
Sat Oct 10 11:51:29 2020 - [info] Getting new master's binlog name and position..
Sat Oct 10 11:51:29 2020 - [info] mysql-bin.000010:141523
Sat Oct 10 11:51:29 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Sat Oct 10 11:51:29 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000010, 141523, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20746,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Sat Oct 10 11:51:29 2020 - [info] Executing master IP activate script:
Sat Oct 10 11:51:29 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Sat Oct 10 11:51:29 2020 - [info] OK.
Sat Oct 10 11:51:29 2020 - [info] ** Finished master recovery successfully.
Sat Oct 10 11:51:29 2020 - [info] * Phase 3: Master Recovery Phase completed.
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] * Phase 4: Slaves Recovery Phase..
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Sat Oct 10 11:51:29 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 79937. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201010115128.log if it takes time..
Sat Oct 10 11:51:30 2020 - [info]
Sat Oct 10 11:51:30 2020 - [info] Log messages from 172.16.120.12 ...
Sat Oct 10 11:51:30 2020 - [info]
Sat Oct 10 11:51:29 2020 - [info] Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Sat Oct 10 11:51:29 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 11:51:29 2020 - [info] Slave started.
Sat Oct 10 11:51:29 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-20746,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Sat Oct 10 11:51:30 2020 - [info] End of log messages from 172.16.120.12.
Sat Oct 10 11:51:30 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Sat Oct 10 11:51:30 2020 - [info] All new slave servers recovered successfully.
Sat Oct 10 11:51:30 2020 - [info]
Sat Oct 10 11:51:30 2020 - [info] * Phase 5: New master cleanup phase..
Sat Oct 10 11:51:30 2020 - [info]
Sat Oct 10 11:51:30 2020 - [info] Resetting slave info on the new master..
Sat Oct 10 11:51:30 2020 - [info] 172.16.120.11: Resetting slave info succeeded.
Sat Oct 10 11:51:30 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Sat Oct 10 11:51:30 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Sat Oct 10 11:51:30 2020 - [info] Sending mail..

slave-1成了new master

1
2
3
4
5
6
7
8
9
10
11
12
root@localhost 11:51:06 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 1 | 2020-10-10 11:21:39 |
| 2 | 2020-10-10 11:31:45 |
| 3 | 2020-10-10 11:51:01 |
+----+---------------------+
3 rows in set (0.00 sec)

root@localhost 11:54:53 [dbms_monitor]> show slave status;
Empty set (0.00 sec)

ping_type=INSERT

启动manager后, 关闭slave-1 sql_thread

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 14:06:09 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 14:06:10 2020 - [info] GTID failover mode = 1
Sat Oct 10 14:06:10 2020 - [info] Dead Servers:
Sat Oct 10 14:06:10 2020 - [info] Alive Servers:
Sat Oct 10 14:06:10 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:06:10 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 14:06:10 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 14:06:10 2020 - [info] Alive Slaves:
Sat Oct 10 14:06:10 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:06:10 2020 - [info] GTID ON
Sat Oct 10 14:06:10 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:06:10 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:06:10 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:06:10 2020 - [info] GTID ON
Sat Oct 10 14:06:10 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:06:10 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:06:10 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:06:10 2020 - [info] Checking slave configurations..
Sat Oct 10 14:06:10 2020 - [info] Checking replication filtering settings..
Sat Oct 10 14:06:10 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 14:06:10 2020 - [info] Replication filtering check ok.
Sat Oct 10 14:06:10 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 14:06:10 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 14:06:10 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 14:06:10 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 14:06:10 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 14:06:10 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 14:06:10 2020 - [info] OK.
Sat Oct 10 14:06:10 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 14:06:10 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 14:06:10 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 14:06:10 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 14:06:10 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..

关闭slave-1 sql_thread
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
root@localhost 14:10:20 [dbms_monitor]> stop slave sql_thread;
Query OK, 0 rows affected (0.03 sec)

root@localhost 14:22:34 [dbms_monitor]> pager cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos'
PAGER set to 'cat - | grep -E 'Master_Log_File|Relay_Master_Log_File|Read_Master_Log_Pos|Exec_Master_Log_Pos|Slave_IO_Running|Slave_SQL_Running|Slave_SQL_Running_State|Last|Relay_Log_File|Relay_Log_Pos''
root@localhost 14:22:57 [dbms_monitor]> show slave status\G
Master_Log_File: mysql-bin.000016
Read_Master_Log_Pos: 238476
Relay_Log_File: mysql-relay-bin.000004
Relay_Log_Pos: 211835
Relay_Master_Log_File: mysql-bin.000016
Slave_IO_Running: Yes
Slave_SQL_Running: No
Last_Errno: 0
Last_Error:
Exec_Master_Log_Pos: 211756
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Slave_SQL_Running_State:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
1 row in set (0.00 sec)

root@localhost 14:22:57 [dbms_monitor]> pager
Default pager wasn't set, using stdout.

master
1
2
3
4
5
6
7
8
9
10
11
root@localhost 14:22:09 [dbms_monitor]> insert into monitor_delay values(90, now());
Query OK, 1 row affected (0.00 sec)

root@localhost 14:22:29 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 88 | 2020-10-10 12:21:17 |
| 90 | 2020-10-10 14:22:29 |
+----+---------------------+
2 rows in set (0.01 sec)

slave-1
1
2
3
4
5
6
7
root@localhost 14:22:57 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 88 | 2020-10-10 12:21:17 |
+----+---------------------+
1 row in set (0.00 sec)

slave-2
1
2
3
4
5
6
7
8
root@localhost 14:22:36 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 88 | 2020-10-10 12:21:17 |
| 90 | 2020-10-10 14:22:29 |
+----+---------------------+
2 rows in set (0.00 sec)

关闭sql_thread后 manager仍然正常

关闭master

1
2
root@localhost 14:23:38 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

结论: 会failover且成功

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
Sat Oct 10 14:25:05 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Sat Oct 10 14:25:05 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Sat Oct 10 14:25:05 2020 - [info] Executing SSH check script: exit 0
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Sat Oct 10 14:25:06 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 14:25:06 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 14:25:08 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 14:25:08 2020 - [warning] Connection failed 2 time(s)..
Sat Oct 10 14:25:11 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 14:25:11 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 14:25:14 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 14:25:14 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 14:25:14 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 14:25:14 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 14:25:14 2020 - [warning] SSH is reachable.
Sat Oct 10 14:25:14 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 14:25:14 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 14:25:14 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 14:25:14 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 14:25:15 2020 - [warning] SQL Thread is stopped(no error) on 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 14:25:15 2020 - [info] GTID failover mode = 1
Sat Oct 10 14:25:15 2020 - [info] Dead Servers:
Sat Oct 10 14:25:15 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:15 2020 - [info] Alive Servers:
Sat Oct 10 14:25:15 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 14:25:15 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 14:25:15 2020 - [info] Alive Slaves:
Sat Oct 10 14:25:15 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:15 2020 - [info] GTID ON
Sat Oct 10 14:25:15 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:15 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:15 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:15 2020 - [info] GTID ON
Sat Oct 10 14:25:15 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:15 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:15 2020 - [info] Checking slave configurations..
Sat Oct 10 14:25:15 2020 - [info] Checking replication filtering settings..
Sat Oct 10 14:25:15 2020 - [info] Replication filtering check ok.
Sat Oct 10 14:25:15 2020 - [info] Master is down!
Sat Oct 10 14:25:15 2020 - [info] Terminating monitoring script.
Sat Oct 10 14:25:15 2020 - [info] Got exit code 20 (Master dead).
Sat Oct 10 14:25:15 2020 - [info] MHA::MasterFailover version 0.58.
Sat Oct 10 14:25:15 2020 - [info] Starting master failover.
Sat Oct 10 14:25:15 2020 - [info]
Sat Oct 10 14:25:15 2020 - [info] * Phase 1: Configuration Check Phase..
Sat Oct 10 14:25:15 2020 - [info]
Sat Oct 10 14:25:16 2020 - [warning] SQL Thread is stopped(no error) on 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 14:25:16 2020 - [info] GTID failover mode = 1
Sat Oct 10 14:25:16 2020 - [info] Dead Servers:
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Alive Servers:
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 14:25:16 2020 - [info] Alive Slaves:
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info] Starting SQL thread on 172.16.120.11(172.16.120.11:3358) ..
Sat Oct 10 14:25:16 2020 - [info] done.
Sat Oct 10 14:25:16 2020 - [info] Starting GTID based failover.
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] ** Phase 1: Configuration Check Phase completed.
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] * Phase 2: Dead Master Shutdown Phase..
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] Forcing shutdown so that applications never connect to the current master..
Sat Oct 10 14:25:16 2020 - [info] Executing master IP deactivation script:
Sat Oct 10 14:25:16 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --command=stopssh --ssh_user=root
Disabling the VIP on old master: 172.16.120.10
RTNETLINK answers: Cannot assign requested address
Fake!!! 原主库 rpl_semi_sync_master_enabled=0 rpl_semi_sync_slave_enabled=1
Sat Oct 10 14:25:16 2020 - [info] done.
Sat Oct 10 14:25:16 2020 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master.
Sat Oct 10 14:25:16 2020 - [info] * Phase 2: Dead Master Shutdown Phase completed.
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] * Phase 3: Master Recovery Phase..
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] * Phase 3.1: Getting Latest Slaves Phase..
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] The latest binary log file/position on all slaves is mysql-bin.000016:268346
Sat Oct 10 14:25:16 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:21217-22354
Sat Oct 10 14:25:16 2020 - [info] Latest slaves (Slaves that received relay log files to the latest):
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info] The oldest binary log file/position on all slaves is mysql-bin.000016:268346
Sat Oct 10 14:25:16 2020 - [info] Retrieved Gtid Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:21217-22354
Sat Oct 10 14:25:16 2020 - [info] Oldest slaves:
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] * Phase 3.3: Determining New Master Phase..
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] Searching new master from slaves..
Sat Oct 10 14:25:16 2020 - [info] Candidate masters from the configuration file:
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:25:16 2020 - [info] GTID ON
Sat Oct 10 14:25:16 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:25:16 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:25:16 2020 - [info] Non-candidate masters:
Sat Oct 10 14:25:16 2020 - [info] Searching from candidate_master slaves which have received the latest relay log events..
Sat Oct 10 14:25:16 2020 - [info] New master is 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 14:25:16 2020 - [info] Starting master failover..
Sat Oct 10 14:25:16 2020 - [info]
From:
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

To:
172.16.120.11(172.16.120.11:3358) (new master)
+--172.16.120.12(172.16.120.12:3358)
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] * Phase 3.3: New Master Recovery Phase..
Sat Oct 10 14:25:16 2020 - [info]
Sat Oct 10 14:25:16 2020 - [info] Waiting all logs to be applied..
Sat Oct 10 14:25:17 2020 - [info] done.
Sat Oct 10 14:25:17 2020 - [info] Replicating from the latest slave 172.16.120.12(172.16.120.12:3358) and waiting to apply..
Sat Oct 10 14:25:17 2020 - [info] Waiting all logs to be applied on the latest slave..
Sat Oct 10 14:25:17 2020 - [info] Resetting slave 172.16.120.11(172.16.120.11:3358) and starting replication from the new master 172.16.120.12(172.16.120.12:3358)..
Sat Oct 10 14:25:17 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 14:25:17 2020 - [info] Slave started.
Sat Oct 10 14:25:17 2020 - [info] Waiting to execute all relay logs on 172.16.120.11(172.16.120.11:3358)..
Sat Oct 10 14:25:17 2020 - [info] master_pos_wait(mysql-bin.000007:3608644) completed on 172.16.120.11(172.16.120.11:3358). Executed 1 events.
Sat Oct 10 14:25:17 2020 - [info] done.
Sat Oct 10 14:25:17 2020 - [info] done.
Sat Oct 10 14:25:17 2020 - [info] Getting new master's binlog name and position..
Sat Oct 10 14:25:17 2020 - [info] mysql-bin.000010:517718
Sat Oct 10 14:25:17 2020 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='172.16.120.11', MASTER_PORT=3358, MASTER_AUTO_POSITION=1, MASTER_USER='repler', MASTER_PASSWORD='xxx';
Sat Oct 10 14:25:17 2020 - [info] Master Recovery succeeded. File:Pos:Exec_Gtid_Set: mysql-bin.000010, 517718, 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-22354,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Sat Oct 10 14:25:17 2020 - [info] Executing master IP activate script:
Sat Oct 10 14:25:17 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=start --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358 --new_master_host=172.16.120.11 --new_master_ip=172.16.120.11 --new_master_port=3358 --new_master_user='mha' --new_master_password=xxx
Enabling the VIP - 172.16.120.128 on the new master - 172.16.120.11
Fake!!! 新主库 rpl_semi_sync_master_enabled=1 rpl_semi_sync_slave_enabled=0
Set read_only=0 on the new master.
Creating app user on the new master..
Sat Oct 10 14:25:18 2020 - [info] OK.
Sat Oct 10 14:25:18 2020 - [info] ** Finished master recovery successfully.
Sat Oct 10 14:25:18 2020 - [info] * Phase 3: Master Recovery Phase completed.
Sat Oct 10 14:25:18 2020 - [info]
Sat Oct 10 14:25:18 2020 - [info] * Phase 4: Slaves Recovery Phase..
Sat Oct 10 14:25:18 2020 - [info]
Sat Oct 10 14:25:18 2020 - [info]
Sat Oct 10 14:25:18 2020 - [info] * Phase 4.1: Starting Slaves in parallel..
Sat Oct 10 14:25:18 2020 - [info]
Sat Oct 10 14:25:18 2020 - [info] -- Slave recovery on host 172.16.120.12(172.16.120.12:3358) started, pid: 89417. Check tmp log /masterha/cls_new//172.16.120.12_3358_20201010142515.log if it takes time..
Sat Oct 10 14:25:19 2020 - [info]
Sat Oct 10 14:25:19 2020 - [info] Log messages from 172.16.120.12 ...
Sat Oct 10 14:25:19 2020 - [info]
Sat Oct 10 14:25:18 2020 - [info] Resetting slave 172.16.120.12(172.16.120.12:3358) and starting replication from the new master 172.16.120.11(172.16.120.11:3358)..
Sat Oct 10 14:25:18 2020 - [info] Executed CHANGE MASTER.
Sat Oct 10 14:25:18 2020 - [info] Slave started.
Sat Oct 10 14:25:18 2020 - [info] gtid_wait(44a4ea53-fcad-11ea-bd16-0050563b7b42:1-22354,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27) completed on 172.16.120.12(172.16.120.12:3358). Executed 0 events.
Sat Oct 10 14:25:19 2020 - [info] End of log messages from 172.16.120.12.
Sat Oct 10 14:25:19 2020 - [info] -- Slave on host 172.16.120.12(172.16.120.12:3358) started.
Sat Oct 10 14:25:19 2020 - [info] All new slave servers recovered successfully.
Sat Oct 10 14:25:19 2020 - [info]
Sat Oct 10 14:25:19 2020 - [info] * Phase 5: New master cleanup phase..
Sat Oct 10 14:25:19 2020 - [info]
Sat Oct 10 14:25:19 2020 - [info] Resetting slave info on the new master..
Sat Oct 10 14:25:19 2020 - [info] 172.16.120.11: Resetting slave info succeeded.
Sat Oct 10 14:25:19 2020 - [info] Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Sat Oct 10 14:25:19 2020 - [info]

----- Failover Report -----

cls_new: MySQL Master failover 172.16.120.10(172.16.120.10:3358) to 172.16.120.11(172.16.120.11:3358) succeeded

Master 172.16.120.10(172.16.120.10:3358) is down!

Check MHA Manager logs at centos-4:/masterha/cls_new/manager.log for details.

Started automated(non-interactive) failover.
Invalidated master IP address on 172.16.120.10(172.16.120.10:3358)
Selected 172.16.120.11(172.16.120.11:3358) as a new master.
172.16.120.11(172.16.120.11:3358): OK: Applying all logs succeeded.
172.16.120.11(172.16.120.11:3358): OK: Activated master IP address.
172.16.120.12(172.16.120.12:3358): OK: Slave started, replicating from 172.16.120.11(172.16.120.11:3358)
172.16.120.11(172.16.120.11:3358): Resetting slave info succeeded.
Master failover to 172.16.120.11(172.16.120.11:3358) completed successfully.
Sat Oct 10 14:25:19 2020 - [info] Sending mail..

slave-1

1
2
3
4
5
6
7
8
9

root@localhost 14:23:56 [dbms_monitor]> select * from monitor_delay;
+----+---------------------+
| id | ctime |
+----+---------------------+
| 88 | 2020-10-10 12:21:17 |
| 90 | 2020-10-10 14:22:29 |
+----+---------------------+
2 rows in set (0.00 sec)

[用例测试] master挂了, 且slave也有问题5(部分slave sql_thread error)

master挂了, 在此之前slave-1 sql_thread error了

ping_type=CONNECT

启动manager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 14:34:42 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 14:34:43 2020 - [info] GTID failover mode = 1
Sat Oct 10 14:34:43 2020 - [info] Dead Servers:
Sat Oct 10 14:34:43 2020 - [info] Alive Servers:
Sat Oct 10 14:34:43 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:34:43 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 14:34:43 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 14:34:43 2020 - [info] Alive Slaves:
Sat Oct 10 14:34:43 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:34:43 2020 - [info] GTID ON
Sat Oct 10 14:34:43 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:34:43 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:34:43 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.31-34-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 14:34:43 2020 - [info] GTID ON
Sat Oct 10 14:34:43 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:34:43 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 14:34:43 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 14:34:43 2020 - [info] Checking slave configurations..
Sat Oct 10 14:34:43 2020 - [info] Checking replication filtering settings..
Sat Oct 10 14:34:43 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 14:34:43 2020 - [info] Replication filtering check ok.
Sat Oct 10 14:34:43 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 14:34:43 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 14:34:43 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 14:34:43 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 14:34:43 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 14:34:43 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 14:34:43 2020 - [info] OK.
Sat Oct 10 14:34:43 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 14:34:43 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 14:34:43 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 14:34:43 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 14:34:43 2020 - [info] Ping(CONNECT) succeeded, waiting until MySQL doesn't respond..

制造slave-1 sql_thread error

  1. 在master创建表
    1
    2
    root@localhost 14:35:55 [dbms_monitor]> create table make_error(id int not null auto_increment primary key);
    Query OK, 0 rows affected (0.02 sec)
  2. 在slave-1删除make_error表
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    root@localhost 14:38:10 [dbms_monitor]> set global super_read_only=0;
    Query OK, 0 rows affected (0.00 sec)

    root@localhost 14:38:14 [dbms_monitor]> set sql_log_bin=0;
    Query OK, 0 rows affected (0.00 sec)

    root@localhost 14:38:17 [dbms_monitor]> drop table make_error;
    Query OK, 0 rows affected (0.01 sec)

    root@localhost 14:38:20 [dbms_monitor]> set sql_log_bin=1;
    Query OK, 0 rows affected (0.00 sec)
  3. 在master删除make_error表(slave-1 sql_thread会报错)
    1
    2
    root@localhost 14:36:28 [dbms_monitor]> drop table make_error;
    Query OK, 0 rows affected (0.01 sec)
  4. 查看slave-1复制状态
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    root@localhost 14:38:26 [dbms_monitor]> show slave status\G
    *************************** 1. row ***************************
    Slave_IO_State: Waiting for master to send event
    Master_Host: 172.16.120.10
    Master_User: repler
    Master_Port: 3358
    Connect_Retry: 1
    Master_Log_File: mysql-bin.000017
    Read_Master_Log_Pos: 620
    Relay_Log_File: mysql-relay-bin.000002
    Relay_Log_Pos: 589
    Relay_Master_Log_File: mysql-bin.000017
    Slave_IO_Running: Yes
    Slave_SQL_Running: No
    Replicate_Do_DB:
    Replicate_Ignore_DB:
    Replicate_Do_Table:
    Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
    Replicate_Wild_Ignore_Table:
    Last_Errno: 1051
    Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '44a4ea53-fcad-11ea-bd16-0050563b7b42:22356' at master log mysql-bin.000017, end_log_pos 620. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
    Skip_Counter: 0
    Exec_Master_Log_Pos: 416
    Relay_Log_Space: 1000
    Until_Condition: None
    Until_Log_File:
    Until_Log_Pos: 0
    Master_SSL_Allowed: No
    Master_SSL_CA_File:
    Master_SSL_CA_Path:
    Master_SSL_Cert:
    Master_SSL_Cipher:
    Master_SSL_Key:
    Seconds_Behind_Master: NULL
    Master_SSL_Verify_Server_Cert: No
    Last_IO_Errno: 0
    Last_IO_Error:
    Last_SQL_Errno: 1051
    Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '44a4ea53-fcad-11ea-bd16-0050563b7b42:22356' at master log mysql-bin.000017, end_log_pos 620. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
    Replicate_Ignore_Server_Ids:
    Master_Server_Id: 120103358
    Master_UUID: 44a4ea53-fcad-11ea-bd16-0050563b7b42
    Master_Info_File: mysql.slave_master_info
    SQL_Delay: 0
    SQL_Remaining_Delay: NULL
    Slave_SQL_Running_State:
    Master_Retry_Count: 86400
    Master_Bind:
    Last_IO_Error_Timestamp:
    Last_SQL_Error_Timestamp: 201010 14:39:25
    Master_SSL_Crl:
    Master_SSL_Crlpath:
    Retrieved_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:22355-22356
    Executed_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-22355,
    45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
    Auto_Position: 1
    Replicate_Rewrite_DB:
    Channel_Name:
    Master_TLS_Version:
    1 row in set (0.00 sec)
    此时manager仍然正常运行

关闭master

1
2
root@localhost 14:39:25 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

结论: 会触发failover, 但failover失败

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Sat Oct 10 14:40:59 2020 - [warning] Got error on MySQL connect ping: DBI connect(';host=172.16.120.10;port=3358;mysql_connect_timeout=1','mha',...) failed: Can't connect to MySQL server on '172.16.120.10' (111) at /usr/local/share/perl5/MHA/HealthCheck.pm line 98.
2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 14:40:59 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=CONNECT
Sat Oct 10 14:40:59 2020 - [info] Executing SSH check script: exit 0
Sat Oct 10 14:40:59 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 14:40:59 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 14:41:02 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 14:41:02 2020 - [warning] Connection failed 2 time(s)..
Sat Oct 10 14:41:05 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 14:41:05 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 14:41:08 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 14:41:08 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 14:41:08 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 14:41:08 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 14:41:08 2020 - [warning] SSH is reachable.
Sat Oct 10 14:41:08 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 14:41:08 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 14:41:08 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 14:41:08 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 14:41:09 2020 - [error][/usr/local/share/perl5/MHA/Server.pm, ln935] SQL Thread is stopped(error) on 172.16.120.11(172.16.120.11:3358)! Errno:1051, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '44a4ea53-fcad-11ea-bd16-0050563b7b42:22356' at master log mysql-bin.000017, end_log_pos 620. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Sat Oct 10 14:41:09 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln703] Server 172.16.120.11(172.16.120.11:3358) is alive, but does not work as a slave!
Sat Oct 10 14:41:09 2020 - [warning] Got Error: at /usr/local/share/perl5/MHA/MasterMonitor.pm line 560.
Sat Oct 10 14:41:09 2020 - [info] Got exit code 1 (Not master dead).

ping_type=INSERT

启动manager

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
Sat Oct 10 15:54:20 2020 - [info] MHA::MasterMonitor version 0.58.
Sat Oct 10 15:54:21 2020 - [info] GTID failover mode = 1
Sat Oct 10 15:54:21 2020 - [info] Dead Servers:
Sat Oct 10 15:54:21 2020 - [info] Alive Servers:
Sat Oct 10 15:54:21 2020 - [info] 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 15:54:21 2020 - [info] 172.16.120.11(172.16.120.11:3358)
Sat Oct 10 15:54:21 2020 - [info] 172.16.120.12(172.16.120.12:3358)
Sat Oct 10 15:54:21 2020 - [info] Alive Slaves:
Sat Oct 10 15:54:21 2020 - [info] 172.16.120.11(172.16.120.11:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 15:54:21 2020 - [info] GTID ON
Sat Oct 10 15:54:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 15:54:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 15:54:21 2020 - [info] 172.16.120.12(172.16.120.12:3358) Version=5.7.29-32-log (oldest major version between slaves) log-bin:enabled
Sat Oct 10 15:54:21 2020 - [info] GTID ON
Sat Oct 10 15:54:21 2020 - [info] Replicating from 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 15:54:21 2020 - [info] Primary candidate for the new Master (candidate_master is set)
Sat Oct 10 15:54:21 2020 - [info] Current Alive Master: 172.16.120.10(172.16.120.10:3358)
Sat Oct 10 15:54:21 2020 - [info] Checking slave configurations..
Sat Oct 10 15:54:21 2020 - [info] Checking replication filtering settings..
Sat Oct 10 15:54:21 2020 - [info] binlog_do_db= , binlog_ignore_db=
Sat Oct 10 15:54:21 2020 - [info] Replication filtering check ok.
Sat Oct 10 15:54:21 2020 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Sat Oct 10 15:54:21 2020 - [info] Checking SSH publickey authentication settings on the current master..
Sat Oct 10 15:54:21 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Sat Oct 10 15:54:21 2020 - [info]
172.16.120.10(172.16.120.10:3358) (current master)
+--172.16.120.11(172.16.120.11:3358)
+--172.16.120.12(172.16.120.12:3358)

Sat Oct 10 15:54:21 2020 - [info] Checking master_ip_failover_script status:
Sat Oct 10 15:54:21 2020 - [info] /etc/masterha/scripts/master_ip_failover_vip --vip=172.16.120.128 --command=status --ssh_user=root --orig_master_host=172.16.120.10 --orig_master_ip=172.16.120.10 --orig_master_port=3358
Sat Oct 10 15:54:21 2020 - [info] OK.
Sat Oct 10 15:54:21 2020 - [warning] shutdown_script is not defined.
Sat Oct 10 15:54:21 2020 - [info] Set master ping interval 3 seconds.
Sat Oct 10 15:54:21 2020 - [info] Set secondary check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12
Sat Oct 10 15:54:21 2020 - [info] Starting ping health check on 172.16.120.10(172.16.120.10:3358)..
Sat Oct 10 15:54:21 2020 - [info] Ping(INSERT) succeeded, waiting until MySQL doesn't respond..

准备工作省略, slave-1 sql_thread报错
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
root@localhost 15:55:13 [dbms_monitor]> show slave status\G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.16.120.10
Master_User: repler
Master_Port: 3358
Connect_Retry: 1
Master_Log_File: mysql-bin.000019
Read_Master_Log_Pos: 16331
Relay_Log_File: mysql-relay-bin.000007
Relay_Log_Pos: 14926
Relay_Master_Log_File: mysql-bin.000019
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1051
Last_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '44a4ea53-fcad-11ea-bd16-0050563b7b42:22419' at master log mysql-bin.000019, end_log_pos 14917. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Skip_Counter: 0
Exec_Master_Log_Pos: 14713
Relay_Log_Space: 19342
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1051
Last_SQL_Error: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '44a4ea53-fcad-11ea-bd16-0050563b7b42:22419' at master log mysql-bin.000019, end_log_pos 14917. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Replicate_Ignore_Server_Ids:
Master_Server_Id: 120103358
Master_UUID: 44a4ea53-fcad-11ea-bd16-0050563b7b42
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 201010 15:55:16
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:22356-22425
Executed_Gtid_Set: 44a4ea53-fcad-11ea-bd16-0050563b7b42:1-22418,
45d1f02a-fcad-11ea-8a44-0050562f2198:1-27
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)

关闭master

1
2
root@localhost 15:55:16 [dbms_monitor]> shutdown;
Query OK, 0 rows affected (0.00 sec)

结论: 会触发failover, 但failover失败

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Sat Oct 10 15:56:03 2020 - [warning] Got error on MySQL insert ping: 2006 (MySQL server has gone away)
Sat Oct 10 15:56:03 2020 - [info] Executing secondary network check script: masterha_secondary_check -s 172.16.120.11 -s 172.16.120.12 --user=root --master_host=172.16.120.10 --master_ip=172.16.120.10 --master_port=3358 --master_user=mha --master_password=xxx --ping_type=INSERT
Sat Oct 10 15:56:03 2020 - [info] Executing SSH check script: exit 0
Sat Oct 10 15:56:04 2020 - [info] HealthCheck: SSH to 172.16.120.10 is reachable.
Monitoring server 172.16.120.11 is reachable, Master is not reachable from 172.16.120.11. OK.
Monitoring server 172.16.120.12 is reachable, Master is not reachable from 172.16.120.12. OK.
Sat Oct 10 15:56:04 2020 - [info] Master is not reachable from all other monitoring servers. Failover should start.
Sat Oct 10 15:56:06 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 15:56:06 2020 - [warning] Connection failed 2 time(s)..
Sat Oct 10 15:56:09 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 15:56:09 2020 - [warning] Connection failed 3 time(s)..
Sat Oct 10 15:56:12 2020 - [warning] Got error on MySQL connect: 2003 (Can't connect to MySQL server on '172.16.120.10' (111))
Sat Oct 10 15:56:12 2020 - [warning] Connection failed 4 time(s)..
Sat Oct 10 15:56:12 2020 - [warning] Master is not reachable from health checker!
Sat Oct 10 15:56:12 2020 - [warning] Master 172.16.120.10(172.16.120.10:3358) is not reachable!
Sat Oct 10 15:56:12 2020 - [warning] SSH is reachable.
Sat Oct 10 15:56:12 2020 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha/conf/masterha_default.cnf and /etc/masterha/conf/cls_new.cnf again, and trying to connect to all servers to check server status..
Sat Oct 10 15:56:12 2020 - [info] Reading default configuration from /etc/masterha/conf/masterha_default.cnf..
Sat Oct 10 15:56:12 2020 - [info] Reading application default configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 15:56:12 2020 - [info] Reading server configuration from /etc/masterha/conf/cls_new.cnf..
Sat Oct 10 15:56:13 2020 - [error][/usr/local/share/perl5/MHA/Server.pm, ln935] SQL Thread is stopped(error) on 172.16.120.11(172.16.120.11:3358)! Errno:1051, Error:Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction '44a4ea53-fcad-11ea-bd16-0050563b7b42:22419' at master log mysql-bin.000019, end_log_pos 14917. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
Sat Oct 10 15:56:13 2020 - [error][/usr/local/share/perl5/MHA/ServerManager.pm, ln703] Server 172.16.120.11(172.16.120.11:3358) is alive, but does not work as a slave!
Sat Oct 10 15:56:13 2020 - [warning] Got Error: at /usr/local/share/perl5/MHA/MasterMonitor.pm line 560.
Sat Oct 10 15:56:13 2020 - [info] Got exit code 1 (Not master dead).