MGR vs PXC Benchmark

MGR vs PXC Benchmark

本次测试将MGR与PXC进行对比, 意在比较MGR自身不同流控阈值下性能表现以及观察MGR与PXC相同负载下的性能表现. 测试工具使用sysbench 1.1.0-431660d

基础环将信息如下表所示, 在三个测试服务器部署了3306, 3307两套集群

Host IP MGR PXC
data-191 192.168.8.191 3306 3307
data-219 192.168.8.219 3306 3307
data-220 192.168.8.220 3306 3307

阅读全文

MySQL Group Replication- understanding Flow Control[译文]

MySQL Group Replication: understanding Flow Control

使用MySQL Group Replication时,某些成员可能落后于该组。 由于负载,硬件限制等原因…这种滞后可能会导致不能保持良好的性能认证行为和不能尽可能降低认证失败次数.应用队列(applying queue)越大, 与尚未应用的事务发生冲突的风险就越大(这在Multi-Primary Groups上是有问题的)。

Galera用户已经熟悉这个概念(译注: 指流控的概念). MySQL Group Replication的实现与Galera有两个主要不同点:

  • the Group is never totally stalled(译注: Group不会彻底停止接收写请求)

  • the node having issues doesn’t send flow control messages to the rest of the group asking for slowing down(译注: 短板节点不会向Group的其他节点发送flow control消息来要求集群减速)

实际上,该组的每个成员都会向其他成员发送有关其队列(applier queue and certification queue)的一些统计信息。 然后,如果每个节点意识到一个节点达到了其中一个队列的阈值,则决定是否减速:

1
2
3
4
group_replication_flow_control_applier_threshold (default is 25000)
group_replication_flow_control_certifier_threshold (default is 25000)

因此,如果在节点上将group_replication_flow_control_mode设置为QUOTA,并且看到该集群的其他成员之一落后(达到阈值),则会将写入操作限制为最小配额。 此配额是根据最后一秒中应用的transactions数计算的,然后通过减去上一个周期的“over the quota”消息来减少。

and then it is reduced below that by subtracting the “over the quota” messages from the last period.

这意味着与Galera相反,Galera的(flow control)阈值是有短板节点决定的,对于我们在MySQL组复制中,编写事务的节点(译注: 写节点)检查其阈值流量控制值并将它们与来自其他节点的统计数据进行比较以决定是否进行flow control。

您可以在Vitor的文章Zooming-in on Group Replication Performance一文中找到有关组复制流控制的更多信息

如何确定Single-Primary模式下的MGR主节点

如何确定Single-Primary模式下的MGR主节点(文档 ID 2214438.1)

MySQL 5.7

可以通过global status group_replication_primary_member确定

1
2
3
4
5
6
7
8
9
mysql> SELECT VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME = 'group_replication_primary_member';
+--------------------------------------+
| VARIABLE_VALUE |
+--------------------------------------+
| 9d7f8c28-c02c-11e6-9829-08002715584a |
+--------------------------------------+
1 row in set (0.00 sec)

如果是Multi-Primary默认则结果为空

可以结合performance_schema.replication_group_members表 获取主机名和端口信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
SELECT
MEMBER_HOST, MEMBER_PORT
FROM
performance_schema.replication_group_members
WHERE
MEMBER_ID = (SELECT
VARIABLE_VALUE
FROM
performance_schema.global_status
WHERE
VARIABLE_NAME = 'group_replication_primary_member');
+-------------+-------------+
| MEMBER_HOST | MEMBER_PORT |
+-------------+-------------+
| ol7 | 3306 |
+-------------+-------------+
1 row in set (0.00 sec)

或者获取全部成员信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
SELECT
MEMBER_ID,
MEMBER_HOST,
MEMBER_PORT,
MEMBER_STATE,
IF(global_status.VARIABLE_NAME IS NOT NULL,
'PRIMARY',
'SECONDARY') AS MEMBER_ROLE
FROM
performance_schema.replication_group_members
LEFT JOIN
performance_schema.global_status ON global_status.VARIABLE_NAME = 'group_replication_primary_member'
AND global_status.VARIABLE_VALUE = replication_group_members.MEMBER_ID;
+--------------------------------------+-------------+-------------+--------------+-------------+
| MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE |
+--------------------------------------+-------------+-------------+--------------+-------------+
| 9d7f8c28-c02c-11e6-9829-08002715584a | ol7 | 3306 | ONLINE | PRIMARY |
| f2bbb11d-c0c4-11e6-98ec-08002715584a | ol7 | 3308 | ONLINE | SECONDARY |
| f5bb7d78-c02c-11e6-9c56-08002715584a | ol7 | 3307 | ONLINE | SECONDARY |
+--------------------------------------+-------------+-------------+--------------+-------------+
3 rows in set (0.00 sec)

MySQL 8.0.2 and Later

MySQL 8.0.2开始, Performance Schema被扩展

1
2
3
4
5
6
7
8
9
mysql> SELECT MEMBER_HOST, MEMBER_PORT
FROM performance_schema.replication_group_members
WHERE MEMBER_ROLE = 'PRIMARY';
+-------------+-------------+
| MEMBER_HOST | MEMBER_PORT |
+-------------+-------------+
| ol7 | 3306 |
+-------------+-------------+
1 row in set (0.00 sec)

或者获取全部成员信息:

1
2
3
4
5
6
7
8
9
10
mysql> SELECT MEMBER_ID, MEMBER_HOST, MEMBER_PORT, MEMBER_STATE, MEMBER_ROLE, MEMBER_VERSION
FROM performance_schema.replication_group_members;
+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| 33565eab-731f-11e7-8e94-08002715584a | ol7 | 3307 | ONLINE | SECONDARY | 8.0.2 |
| 86034b0a-731f-11e7-9f33-08002715584a | ol7 | 3308 | ONLINE | SECONDARY | 8.0.2 |
| a45d3804-731e-11e7-9003-08002715584a | ol7 | 3306 | ONLINE | PRIMARY | 8.0.2 |
+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.00 sec)

References
https://dev.mysql.com/doc/refman/en/group-replication-find-primary.html
https://dev.mysql.com/doc/refman/en/group-replication-options.html#sysvar_group_replication_single_primary_mode
https://dev.mysql.com/doc/refman/en/replication-group-members-table.html
http://mysqlhighavailability.com/group-replication-extending-group-replication-performance_schema-tables/
http://lefred.be/content/mysql-group-replication-who-is-the-primary-master-updated/

执行sql文件限制频率避免流控

避免流控

对于pxc, 为了避免流控, 可以在导入.sql文件时, 先对文件做处理

1
awk '1;NR%1000==0{print "select sleep(1);"}' xxx.sql > xxx_dba.sql

上面的命令每1000行增加一行select sleep(1);, 这样执行频率是1k/s

对于mysqldump产生的sql文件, 则需要在导出是指定

1
mysqldump --skip-extended-insert

每行一个insert语句, 之后再使用awk处理

使用pt-fifo-split

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
FLAT_FILE="/tmp/big_file.txt"
FIFO_PATH="${FLAT_FILE}.fifo"
LOAD_FILE="${FLAT_FILE}.load"
CHUNK_SIZE=1000
# Split the file
pt-fifo-split --force --lines ${CHUNK_SIZE} ${FLAT_FILE} --fifo ${FIFO_PATH} &
# Sleep 10 seconds to assure ${FIFO_PATH} exists before entering loop
sleep 10
while [ -e ${FIFO_PATH} ]
do
# Write chunk to disk
cat ${FIFO_PATH} > ${LOAD_FILE}
# Load chunk into table
mysql --database=test \
--show-warnings \
-vv < ${LOAD_FILE}
sleep 1
done

Write-set Cache(GCache)

Write-set Cache(GCache)

Galear Cluster将write-sets存储在一个称为Write-set Cache(或称为GCache)的特殊的cache中. GCache cache is a memory allocator for write-sets.主要目的是为了最大限度地减少RAM上的write-setfootprint. Galera集群通过将卸载写入集存储到磁盘来改善此问题.

GCache采用三种类型的存储:

  • Permanent In-Memory Store Here write-sets allocate using the default memory allocator for the operating system. This is useful in systems that have spare RAM. The store has a hard size limit.

    By default it is disabled.

    这个就是说可以用操作系统的内存空间, 对于有空闲内存的系统比较合适. 这个空间大小有一个硬性的限制, 貌似是这个参数gcache.mem_size=0[^1]

    默认不使用内存.

  • Permanent Ring-Buffer File Here write-sets pre-allocate to disk during cache initialization. This is intended as the main write-set store.

    这一个块循环使用的区域, 是写到磁盘上的 , 貌似是这个参数gcache.size = 128M^2

  • On-Demand Page Store 这里写集根据需要在运行时分配给内存映射的页面文件。 默认情况下,其大小为128Mb(gcache.page_size[^3]),但如果需要存储更大的写入集,则可能会更大。页面存储的大小受可用磁盘空间的限制。默认情况下,Galera Cluster会在不使用时删除页面文件,但您可以对要保留的页面文件的总大小设置限制。

    When all other stores are disabled, at least one page file remains present on disk.

Galera集群使用分配算法,尝试按上述顺序存储写入集。也就是说,它首先尝试使用永久性内存存储。如果写入集没有足够的空间,它将尝试存储到永久环形缓冲区文件。页面存储总是成功,除非写入集大于可用磁盘空间。

Galera Cluster uses an allocation algorithm that attempts to store write-sets in the above order. That is, first it attempts to use permanent in-memory store. If there is not enough space for the write-set, it attempts to store to the permanent ring-buffer file. The page store always succeeds, unless the write-set is larger than the available disk space.

默认情况下,写集缓存分配进程工作目录中的文件。您可以使用gcache.dir参数指定写入集缓存的专用位置.

[^1]: Deprecated in 5.6.22-25.8 . This variable has been deprecated and shouldn’t be used as it could cause a node to crash.

image-20180601165051861

当节点接收状态传输时,它们不能处理传入的写集,直到它们完成状态更新。在某些方法下,发送状态转移的节点也被阻塞。为防止数据库进一步落后,GCache将内存映射文件中的传入写入集保存到磁盘。 此参数定义要为当前环形缓冲区存储分配的磁盘空间量。节点在启动数据库服务器时分配此空间。

[^3]: This variable can be used to specify the size of the page files in the page storage.

image-20180601165332896

Want IST Not SST for Node Rejoins? We Have a Solution!

Want IST Not SST for Node Rejoins? We Have a Solution!

Krunal Bauskar | February 13, 2018 | Posted In: High-availability, MySQL, Percona XtraDB Cluster

如果我们告诉你,有一种确定的方法可以让节点rejoin使用IST而不是SST?您可以保证新节点使用IST重新加入. 听起来很有趣?请继续阅读.

通常当一个节点脱离集群一段时间(处于维护目的或就是shutdown了), 集群上其他节点的gcache将用来在前者重新加入集群时提供前者在脱离期间缺失的write-set(s). 如果您配置了较大的gcache, 或downtime足够段, 则此方法可行. 对于生产环境来说, 无论是设置较大的gcache或者缩短停机窗口都不够好.

在停机之前在潜在的DONOR节点上重新配置gcache需要关闭节点.(gcache不能动态调整大小), Restoring it back to original size needs another shutdown. So “three shutdowns” for a single downtime. *No way …… not acceptable with busy production clusters and the possibility of more errors.*

Introducing “gcache.freeze_purge_at_seqno”

基于以上痛点, 我们在Percona XtraDB Cluster 5.7.20引入了gcache.freeze_purge_at_seqno.这将控制清除gcache, 从而在节点重新加入时保留更多的数据以促进IST.

Galera集群世界中的所有事务都被分配了唯一的全局序列号(seqno).跟踪事情发生使用此seqno(如wsrep_last_applied,wsrep_last_committed,wsrep_replicated,wsrep_local_cached_downto等^1).wsrep_local_cached_downto表示gcache已被清除的序列号。假设wsrep_local_cached_downto = N,那么gcache具有来自[N,wsrep_replicated]的数据, 并清除了[1,N)数据。

gcache.freeze_purge_at_seqno takes three values:

1. -1(默认值): no freeze, the purge operates as normal.
2. **x (should be valid seqno in gcache):** freeze purge of write-sets >= x. The best way to select x is to use the wsrep_last_applied value as an indicator from the node that you plan to shut down. (wsrep_applied * 0.09. Retain this extra 10% to trick the [safety gap heuristic algorithm of IST](https://www.percona.com/blog/2017/11/15/understanding-ist-donor-selected/).)
3. **now:** freeze purge of write-sets &gt;= smallest seqno currently in gcache. Instant freeze of gcache-purge. (If tracing x (above) is difficult, simply use “now” and you are good).

在集群的现有节点上进行设置(这将继续作为集群的一部分,并可以充当潜在的捐助者)。该节点继续保留写集,从而允许重启节点使用IST重新加入。 (您可以在重启需要rejoin的节点时通过指定—wsrep_sst_donor将该节点作为首选DONOR进行提供)

Set this on an existing node of the cluster (that will continue to be part of the cluster and can act as potential DONOR). This node continues to retain the write-sets, thereby allowing the restarting node to rejoin using IST. (You can feed the said node as a preferred DONOR through wsrep_sst_donor while restarting the said rejoining node.)

请记住,一旦节点重新加入,请将其设回-1。这可以避免超出上述时间表的DONOR上的占用空间。在下一个清除周期中,所有旧的保留写入集也会被释放(回收空间回到原始状态).

Note:

1
2
3
4
5
To find out existing value of gcache.freeze_purge_at_seqno query wsrep_provider_options.
select @@wsrep_provider_options;
To set gcache.freeze_purge_at_seqno
set global wsrep_provider_options="gcache.freeze_purge_at_seqno = now";

Why should you use it?

  • gcache动态增长(使用现有的页面存储机制),并在用户将其设置回-1时收缩。这意味着您只在需要时使用(更多的)磁盘空间.
  • 不需要重新启动. 用户只需专注于需要维护的节点.
  • No complex math or understanding of seqno involved (simply use “now”).
  • Less prone to error, as SST is one of the major error-prone areas with the cluster.

So why wait? Give it a try! It is part of Percona XtraDB Cluster 5.7.20 onwards, and helps you get IST not SST for node rejoins

Note: If you need more information about gcache, check here and here


Powered by Hexo and Hexo-theme-hiker

Copyright © 2013 - 2018 Fan() All Rights Reserved.

访客数 : | 访问量 :

#