[译]PXC7中故障场景及恢复方法

2018-05-29 阅读量

与标准的MySQL复制不同，PXC群集就像一个逻辑实体，它负责关注每个节点的状态和一致性以及群集状态。这样可以保持更好的数据完整性，然后您可以从传统的异步复制中获益，同时允许在同一时间在多个节点上进行安全写入.

假设我们有一个PXC集群包含3个节点

情景1

节点A正常停止, 例如需要停库做一些维护, 配置变更等操作.

在这种情况下，其他节点从该节点接收”good bye”消息, 因此集群大小将会缩小(在这个例子中缩小为2), 并且某些属性如quorum caculation和auto increment(我理解为auto_increment_increment和auto_increment_offset会由于集群扩缩动态调整). 一旦我们重新开始的一个节点，它会根据它的my.cnf中wsrep_cluster_address设置加入群集. 这个过程与普通的复制有很大的不同 - 在A节点再次与集群完全同步之前，A节点不会接受供任何请求，因此仅仅是与集群建立连接是不够的，而是必须要先完成state transfer. 如果B或C节点的writeset cache (gcache.size), 仍然有恢复A节点所需的所有事务, 那么将使用IST 否则使用 SST . 因此，如本文所示，确定最佳donor很重要. 如果由于donor的gcache中缺少交易而导致IST不可用，则由donor作出回退决定，而SST自动启动。

情景2

节点A和B正常停止. 与之前的情况类似, 集群size缩小至1, 因此即使单个剩余的节点C也是主要组件并且正在服务于客户端请求. 为了让节点回到集群中，你只需要启动它们. 然而，节点C将被切换到“Donor/Desynced”状态，因为它将不得不向至少第一加入节点提供state transfer。在这个过程中，它仍然可以读/写，但它可能会慢得多，这取决于它需要发送多大的state transfer. Also some load balancers may consider the donor node as not operational and remove it from the pool. So it is best to avoid situation when only one node is up.

但请注意，如果您按顺序重新启动A然后B，则可能需要确保B不会使用A作为状态转移捐助者，因为A可能没有在它的gcache中包含所有需要的写入集。因此，只需以这种方式将C节点指定为捐助者即可（“nodeC”名称是您使用wsrep_node_name变量指定的名称）：

1	service mysql start --wsrep_sst_donor=nodeC

情景3

三个节点都正常停止. 整个集群已经不可用了. 现在的问题是, 如何重新initialize集群. 有一个关键点需要知道的是, 在clean shutdown的过程中, PXC节点会将自己最后执行的位置记录到grastate.dat文件中. 通过对比grastate.dat中的seqno, 可以帮助我们找到the most advanced one(最有可能的是最后一个停止的节点). 集群必须 bootstrapped using this node, 换言之, seqno最大的节点需要执行full SST给其他joiner. To bootstrap the first node, invoke the startup script like this:

1	/etc/init.d/mysql bootstrap-pxc

或

1	service mysql bootstrap-pxc

或

1	service mysql start --wsrep_new_cluster

或

1	service mysql start --wsrep-cluster-address="gcomm://"

or in packages using systemd service manager (Centos7 at the moment):

1	systemctl start mysql@bootstrap.service

In older PXC versions, to bootstrap cluster, you had to edit my.cnf and replace previous wsrep_cluster_address line with empty value like this: wsrep_cluster_address=gcomm:// and start mysql normally. More details to be found here.

Please note that even if you bootstrap from the most advanced node, so the other nodes have lower sequence number, they will have to still join via full-SST because the Galera Cache is not retained on restart. For that reason, it is recommended to stop writes to the cluster before it’s full shutdown, so that all nodes stop in the same position. Edit: This changes since Galera 3.19 thanks to gcache-recover option

情景 4

情景4

节点A从集群中消失(断电, 硬件故障, kernel panic, mysqld crash, kill -9 on mysqld pid, OOMkiller). B/C节点意识到与A节点连接中断, 会尝试重连. 重连超时后, both agree that node A is really down and remove it “officially” from the cluster. Quorum is saved ( 2 out of 3 nodes are up), so no service disruption happens. After restarting, A will join automatically the same way as in scenario 1.

情景5

A,B节点均从集群中消失. The node C is not able to form the quorum alone, 集群切换到non-primary模式, 拒绝所有应用请求. 在这种情景下, C节点的mysqld进程仍在运行, 你可以连接到C节点的mysql, 但是执行任何语句都将失败

1 2	mysql> select * from test.t1; ERROR 1047 (08S01): Unknown command

事实上, A,B节点消失初时, C几点仍然可也以完成查询一些请求, 但当C几点意识到无法连接A,B节点后, 就无法完成查询请求了. 而写入请求在certification based replication保障下则会在A,B消失时就立即被拒绝. This is what we are going to see in the remaining node’s log:

140814 0:42:13 [Note] WSREP: commit failed for reason: 3
140814 0:42:13 [Note] WSREP: conflict state: 0
140814 0:42:13 [Note] WSREP: cluster conflict due to certification failure for threads:
140814 0:42:13 [Note] WSREP: Victim thread:
THD: 7, mode: local, state: executing, conflict: cert failure, seqno: -1
SQL: insert into t values (1)

然后单个节点C等待它的对等体再次出现，并且在某些情况下，如果发生这种情况，例如当网络中断和这些节点一直都在运行时，群集将自动再次形成.

同样，如果节点B和C刚刚从第一个节点断开网络，但它们仍然可以到达对方，那么它们将继续运行，因为它们仍然构成法定人数。如果A和B由于停电而崩溃（由于数据不一致，错误等）或关闭，您需要执行手动操作才能在C节点上启用primary component ，before you can bring A and B back。这样，我们告诉C节点“嘿，你现在可以单独形成一个新的群集，忘记A和B！”。执行此操作的命令是^1：

1	SET GLOBAL wsrep_provider_options='pc.bootstrap=true';

However, you should double check in order to be very sure the other nodes are really down before doing that! Otherwise, you will most likely end up with two clusters having different data.

情景6

所有节点意外宕机(All nodes went down without proper shutdown procedure)

这种情况发生在例如数据中心挂了, 或者遇上了MySQL或Galera bug导致所有节点crash. 但也由于数据一致性的影响，集群检测到每个节点都有不同的数据导致所有节点挂了. In each of those cases, the grastate.dat file is not updated and does not contain valid sequence number (seqno). It may look like this:

cat /var/lib/mysql/grastate.dat
# GALERA saved state
version: 2.1
uuid: 220dcdcb-1629-11e4-add3-aec059ad3734
seqno: -1
cert_index:

在这种情况下，我们不确定所有节点是否彼此一致，因此找到最先进(most advanced one )的节点以便使用它来boostrap群集至关重要。在任何节点上启动mysql守护进程之前，您必须通过检查它的事务状态来提取最后一个序列号。你可以这样做：

[root@percona3 ~]# mysqld_safe --wsrep-recover
140821 15:57:15 mysqld_safe Logging to '/var/lib/mysql/percona3_error.log'.
140821 15:57:15 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140821 15:57:15 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.6bUIqM' --pid-file='/var/lib/mysql/percona3-recover.pid'
140821 15:57:17 mysqld_safe WSREP: Recovered position 4b83bbe6-28bb-11e4-a885-4fc539d5eb6a:2
140821 15:57:19 mysqld_safe mysqld from pid file /var/lib/mysql/percona3.pid ended

因此，此节点上最后提交的事务序列号为2.现在，您只需首先从最新节点引导，然后启动其他节点。

So the last committed transaction sequence number on this node was 2. Now you just need to bootstrap from the latest node first and then start the others.

However, the above procedure won’t be needed in the recent Galera versions (3.6+?), available since PXC 5.6.19. There is a new option – pc.recovery (enabled by default), which saves the cluster state into a file named gvwstate.dat on each member node. As the variable name says (pc – primary component), it saves only a cluster being in PRIMARY state. An example content of that file may look like this:

cat /var/lib/mysql/gvwstate.dat
my_uuid: 76de8ad9-2aac-11e4-8089-d27fd06893b9
#vwbeg
view_id: 3 6c821ecc-2aac-11e4-85a5-56fe513c651f 3
bootstrap: 0
member: 6c821ecc-2aac-11e4-85a5-56fe513c651f 0
member: 6d80ec1b-2aac-11e4-8d1e-b2b2f6caf018 0
member: 76de8ad9-2aac-11e4-8089-d27fd06893b9 0
#vwend

We can see three node cluster above with all members being up. Thanks to this new feature, in the case of power outage in our datacenter, after power is back, the nodes will read the last state on startup and will try to restore primary component once all the members again start to see each other. This makes the PXC cluster to automatically recover from being powered down without any manual intervention! In the logs we will see:

140823 15:28:55 [Note] WSREP: restore pc from disk successfully
(...)
140823 15:29:59 [Note] WSREP: declaring 6c821ecc at tcp://192.168.90.3:4567 stable
140823 15:29:59 [Note] WSREP: declaring 6d80ec1b at tcp://192.168.90.4:4567 stable
140823 15:29:59 [Warning] WSREP: no nodes coming from prim view, prim not possible
140823 15:29:59 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 2, memb_num = 3
140823 15:29:59 [Note] WSREP: Flow-control interval: [28, 28]
140823 15:29:59 [Note] WSREP: Received NON-PRIMARY.
140823 15:29:59 [Note] WSREP: New cluster view: global state: 4b83bbe6-28bb-11e4-a885-4fc539d5eb6a:11, view# -1: non-Primary, number of nodes: 3, my index: 2, protocol version -1
140823 15:29:59 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140823 15:29:59 [Note] WSREP: promote to primary component
140823 15:29:59 [Note] WSREP: save pc into disk
140823 15:29:59 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = yes, my_idx = 2, memb_num = 3
140823 15:29:59 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
140823 15:29:59 [Note] WSREP: clear restored view
(...)
140823 15:29:59 [Note] WSREP: Bootstrapped primary 00000000-0000-0000-0000-000000000000 found: 3.
140823 15:29:59 [Note] WSREP: Quorum results:
version = 3,
component = PRIMARY,
conf_id = -1,
members = 3/3 (joined/total),
act_id = 11,
last_appl. = -1,
protocols = 0/6/2 (gcs/repl/appl),
group UUID = 4b83bbe6-28bb-11e4-a885-4fc539d5eb6a
140823 15:29:59 [Note] WSREP: Flow-control interval: [28, 28]
140823 15:29:59 [Note] WSREP: Restored state OPEN -> JOINED (11)
140823 15:29:59 [Note] WSREP: New cluster view: global state: 4b83bbe6-28bb-11e4-a885-4fc539d5eb6a:11, view# 0: Primary, number of nodes: 3, my index: 2, protocol version 2
140823 15:29:59 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140823 15:29:59 [Note] WSREP: REPL Protocols: 6 (3, 2)
140823 15:29:59 [Note] WSREP: Service thread queue flushed.
140823 15:29:59 [Note] WSREP: Assign initial position for certification: 11, protocol version: 3
140823 15:29:59 [Note] WSREP: Service thread queue flushed.
140823 15:29:59 [Note] WSREP: Member 1.0 (percona3) synced with group.
140823 15:29:59 [Note] WSREP: Member 2.0 (percona1) synced with group.
140823 15:29:59 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 11)
140823 15:29:59 [Note] WSREP: Synchronized with group, ready for connections

这个是说PXC 5.6.19以后, 每个节点都有一个gvwstate.dat 文件记录自己挂之前的执行进度, 我的实验kill一个节点后,这个节点

[root@namenode-2 data]# cat gvwstate.dat 
my_uuid: 4db0c1c2-5fe0-11e8-9709-0be9bc3897c1
#vwbeg
view_id: 3 3561356d-5f11-11e8-bba0-82ddca1badfc 16
bootstrap: 0
member: 3561356d-5f11-11e8-bba0-82ddca1badfc 0
member: 4db0c1c2-5fe0-11e8-9709-0be9bc3897c1 0
member: 57b4144b-5f11-11e8-b442-47001a216bae 0
member: 6af280fd-5f0f-11e8-9deb-9fe0e283dd49 0
#vwend

其他正常节点

[root@datanode-1 data]# cat gvwstate.dat 
my_uuid: 6af280fd-5f0f-11e8-9deb-9fe0e283dd49
#vwbeg
view_id: 3 3561356d-5f11-11e8-bba0-82ddca1badfc 17
bootstrap: 0
member: 3561356d-5f11-11e8-bba0-82ddca1badfc 0
member: 57b4144b-5f11-11e8-b442-47001a216bae 0
member: 6af280fd-5f0f-11e8-9deb-9fe0e283dd49 0
#vwend

情景7

Cluster lost it’s primary state due to split brain situation.为了举例, 我们假设由偶数个节点形成- 6个, ABC在一个机房, DEF在另一个机房, 两个机房之间的网络中断了. 当然最好的方案是避免这种拓扑结构, 如果没机器了实在不行还可以用arbitrator (garbd) node或者调整pc.weight参数.但是，当分裂大脑以任何方式发生时，所有分离的组都不能维持法定人数 - 所有节点都必须停止提供请求，并且这两个部分只是不断尝试重新连接。如果要在恢复网络链接之前恢复服务，则可以使用与方案5中相同的命令使其中一个组再次成为主服务器：

1	SET GLOBAL wsrep_provider_options='pc.bootstrap=true';

After that, you are able to work on the manually restored part of the cluster, and the second half should be able to automatically re-join using incremental state transfer (IST) once the network link is restored. But beware: if you set the bootstrap option on both the separated parts, you will end up with two living cluster instances, with data likely diverging away from each other. Restoring network link in that case won’t make them to re-join until nodes are restarted and try to re-connect to members specified in configuration file. Then, as Galera replication model truly cares about data consistency – once the inconsistency will be detected, nodes that cannot execute row change statement due to different data – will perform emergency shutdown and the only way to bring them back to the cluster will be via full SST.

I hope I covered most of the possible failure scenarios of Galera-based clusters, and made the recovery procedures bit more clear.

https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster/