DeltaScope:离线优先的 SQL 审计与迁移风险检查工具

为什么需要 SQL 审计?

数据库变更(DDL/DML)是生产事故的高发区。一条缺少 WHERE 的 DELETE、一个没有 LIMIT 的 UPDATE、一次破坏性的 ALTER TABLE,都可能导致不可逆的数据丢失或服务中断。SQL 错误在执行前捕获成本极低,执行后修复代价极高。

DeltaScope 正是为解决这个问题而生的工具。

DeltaScope 是什么?

DeltaScope 是一个离线优先的 SQL 审计与迁移风险检查工具,支持 MySQL、TiDB、PostgreSQL 三种数据库方言。它为 DBA、应用工程师、CI 流水线和 AI Agent 提供了统一的 SQL 变更审查能力。

核心设计理念:所有规则无需数据库连接即可工作,元数据感知规则在没有快照时优雅降级。

审计实战:看看 DeltaScope 能抓出什么问题

MySQL:CREATE TABLE 全方位审计

下面这条 CREATE TABLE 几乎踩遍了所有坑——保留字命名、MyISAM 引擎、latin1 字符集、外键、TEXT/JSON/BIT/TIMESTAMP 禁用类型、复合主键、重复索引……一条语句触发 20+ 条发现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
CREATE TABLE `select` (
`select` INT,
id INT,
very_long_column_name_over_sixty_four_characters_for_rule_coverage_0001 VARCHAR(16),
bad_varchar VARCHAR(32),
bad_float DOUBLE,
bad_text TEXT,
bad_json JSON,
bad_bit BIT(1),
bad_ts TIMESTAMP,
bad_char CHAR(4),
bad_charset VARCHAR(16) CHARACTER SET latin1,
bad_collation VARCHAR(16) COLLATE latin1_swedish_ci,
no_default INT NOT NULL,
nullable_col INT DEFAULT 0,
CONSTRAINT badpk PRIMARY KEY (id, bad_varchar),
CONSTRAINT baduk UNIQUE KEY (bad_varchar),
CONSTRAINT badfk FOREIGN KEY (id) REFERENCES parent(id),
CONSTRAINT badcheck CHECK (id > 0),
UNIQUE KEY baduniq (bad_varchar),
KEY badidx (bad_varchar, no_default),
KEY baddup (bad_varchar),
KEY badleft (bad_varchar),
FULLTEXT KEY badfull (bad_text)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
ROW_FORMAT=COMPACT AUTO_INCREMENT=9 COMMENT='comment too long';

DeltaScope 输出(节选关键发现):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Verdict: reject
Statements: 1
Blockers: 9
Warnings: 12
Notices: 0

Statement 1: CREATE TABLE
- [blocker] ddl.table.name.keyword.forbid: table name "select" is a reserved keyword
- [blocker] ddl.table.engine.allowlist: engine MyISAM is not in allowlist [InnoDB]
- [blocker] ddl.table.charset.allowlist: charset latin1 is not in allowlist [utf8mb4]
- [blocker] ddl.table.foreign_key.forbid: foreign keys are forbidden
- [blocker] ddl.table.primary_key.not_null.require: primary key column "bad_varchar" is nullable
- [blocker] ddl.column.name.keyword.forbid: column name "select" is a reserved keyword
- [blocker] ddl.column.name.max_length: column name exceeds 64 characters
- [blocker] ddl.column.charset.allowlist: column charset latin1 is not in allowlist [utf8mb4]
- [blocker] ddl.column.varchar.max_length: VARCHAR(32) exceeds max length limit
- [warning] ddl.column.blob_text.forbid: TEXT type is forbidden
- [warning] ddl.column.json.forbid: JSON type is forbidden
- [warning] ddl.column.bit.forbid: BIT type is forbidden
- [warning] ddl.column.timestamp.forbid: TIMESTAMP type is forbidden
- [warning] ddl.column.float_double.forbid: DOUBLE type is forbidden
- [warning] ddl.column.comment.require: column must have a comment
- [warning] ddl.column.default.require: column "no_default" should define a default value
- [warning] ddl.table.audit_columns.require: missing audit columns (created_at, updated_at)
- [warning] ddl.table.comment.max_length: table comment exceeds max length limit
- [warning] ddl.index.duplicate.forbid: index "baddup" duplicates "badleft" on (bad_varchar)
- [warning] ddl.index.redundant_left_prefix.forbid: index is left-prefix redundant
- [warning] ddl.constraint.primary_key.name.prefix.require: constraint "badpk" missing prefix "pk_"

一条 CREATE TABLE,9 个 blocker + 12 个 warning,覆盖了命名规范、引擎/字符集、列类型限制、主键语义、索引冗余、约束命名、审计字段等 7 大类规则。这就是 DeltaScope 的能力——不只是检查”有没有注释”,而是对 DDL 做全方位治理。

Orchestrator Lastest Slave选择逻辑"不合理"导致的数据丢失问题

Orchestrator Lastest Slave选择逻辑”不合理”导致的数据丢失问题

首先说明, “不合理”只是站在个人角度的结论. 到底合理不合理, 我说了不算.
以我对Orchestrator的了解, Orchestrator目标是追求可用性优先, 而非数据完整性. 很多公司也使用了Orchestrator, 我感觉未必知道有这个问题, 或者说, 别问, 问就是”我们追求可用性”.

现在的问题是, 即便开了半同步, 也可能丢数据.

然而矛盾的点是, 线上主从的复制延迟是大家都要监控和管理的, 不会长期处于高延迟状态, 起码我经历的公司都是这样, 99.9%的集群主从延迟在1s内. 个别集群在高峰期会升高一点, 但很快又会下降; 又或者这些集群本身就是AP型业务.

那么既然我们可以保证复制延迟小于1s, 根据ReadBinlogCoordinates选择Latest slave又能导致”恢复时间”增大多少呢? 而为了这X秒的快速恢复, 你又要花多少时间修复数据呢?

server_id相同导致graceful-master-takeover-auto失败

背景

同事要做一个数据库迁移, 方法就是加从库, 然后主从切换, 再下掉老主库, 完成迁移
原本是一主两从集群, 增加三个从库变为一主五从

使用命令切换时报错(以下是本地测试复现的命令)

1
2
resources/bin/orchestrator-client -c graceful-master-takeover-auto -alias orc_prod_infra_broker -d 172.20.0.14:3306 -b 'superuser:superpass'
Desginated instance 172.20.0.14:3306 cannot take over all of its siblings. Error: 2022-09-02 15:33:27 ERROR Relocating 2 replicas of 172.20.0.10:3306 below 172.20.0.14:3306 turns to be too complex; please do it manually

Orchestrator Failover过程源码分析-III

Orchestrator Failover过程源码分析-III

书接上文Orchestrator Failover过程源码分析-II

GetCandidateReplica

1
2
3
4
5
6
7
8
9
// RegroupReplicasGTID will choose a candidate replica of a given instance, and take its siblings using GTID
func RegroupReplicasGTID(
masterKey *InstanceKey, // 实参传进来的是 挂掉的旧主库
returnReplicaEvenOnFailureToRegroup bool, // 实参传进来的是 true
startReplicationOnCandidate bool, // 实参传进来的是 false
onCandidateReplicaChosen func(*Instance), // 实参传进来的是 nil
postponedFunctionsContainer *PostponedFunctionsContainer,
postponeAllMatchOperations func(*Instance, bool) bool, // 实参传进来的是 promotedReplicaIsIdeal 函数
)

RegroupReplicasGTID will choose a candidate replica of a given instance, and take its siblings using GTID
英文简简单单一句话, 中文不知道咋翻译.. 我理解就是RegroupReplicasGTID会从目标实例(即DeadMaster)的从库中选出一个candidate出来, 然后提升他为新主库, 并接管所有的从库

要理解RegroupReplicasGTID, 还是要先看它调用的GetCandidateReplica

Orchestrator Failover过程源码分析-II

Orchestrator Failover过程源码分析-II

书接上文Orchestrator Failover过程源码分析-I

DeadMaster恢复流程

首先通过getCheckAndRecoverFunction获取”checkAndRecoverFunction”

1
2
3
4
5
6
7
8
9
10
11
12
func getCheckAndRecoverFunction(analysisCode inst.AnalysisCode, analyzedInstanceKey *inst.InstanceKey) (
checkAndRecoverFunction func(analysisEntry inst.ReplicationAnalysis, candidateInstanceKey *inst.InstanceKey, forceInstanceRecovery bool, skipProcesses bool) (recoveryAttempted bool, topologyRecovery *TopologyRecovery, err error),
isActionableRecovery bool,
) {
switch analysisCode {
// master
case inst.DeadMaster, inst.DeadMasterAndSomeReplicas: // 如果analysisCode是DeadMaster 或 DeadMasterAndSomeReplicas
if isInEmergencyOperationGracefulPeriod(analyzedInstanceKey) { // 首先判断是否处于 EmergencyOperationGracefulPeriod
return checkAndRecoverGenericProblem, false // 如果处于EmergencyOperationGracefulPeriod, 则又相当于啥也没干, 等下一轮recoverTick
} else {
return checkAndRecoverDeadMaster, true
}

这里先判断isInEmergencyOperationGracefulPeriod

1
2
3
4
func isInEmergencyOperationGracefulPeriod(instanceKey *inst.InstanceKey) bool {
_, found := emergencyOperationGracefulPeriodMap.Get(instanceKey.StringCode()) // emergencyOperationGracefulPeriodMap 是一个cache, 有过期时间的"缓存"
return found
}

实际是尝试去emergencyOperationGracefulPeriodMap这个cache找有没有这个instance. 那么问题来了, 是谁在什么时候向这个cache放这个instance呢?
其实是在主库处于UnreachableMaster状态下, executeCheckAndRecoverFunction中调用runEmergentOperations时

Orchestrator Failover过程源码分析-I

Orchestrator Failover过程源码分析-I

模拟故障

使用测试环境, 模拟3307集群故障

角色 IP 端口 主机名
主库 172.16.120.10 3307 centos-1
从库 172.16.120.11 3307 centos-2
从库 172.16.120.12 3307 centos-3

关闭3307主库172.16.120.10:3307

1
2
3
4
5
6
[2022-04-25 13:10:56][root@centos-1 13:10:56 ~]
[2022-04-25 13:11:22]#systemctl stop mysql3307


mysql日志
2022-04-25T13:11:35.959667+08:00 0 [Note] /usr/local/mysql5732/bin/mysqld: Shutdown complete

Orchestrator Discover源码分析

Orchestrator Discover源码分析

在梳理HostnameResolveMethodMySQLHostnameResolveMethod 两个参数时我产生了一些迷惑. 所以深入看了下orchestrator源码, 再此记录下.

orchestrator-client

阅读orchestrator-client可以发现, 我们可以通过运行以下命令做”服务发现”

1
2
3
orchestrator-client -c discover -i 172.16.120.10:3306

假设 172.16.120.10 主机名为 centos-1

orchestrator-client是一个脚本, 用方便的命令行界面包装API调用.

它可以自动确定orchestrator集群的leader, 并在这种情况下将所有请求转发给leader.

它非常接近于orchestrator command line interface.

orchestrator-client -help 有bug, 已提交PR.

orchestrator-client help信息也没有介绍-i参数

查看orchestrator-client源码

1
2
3
4
5
6
while getopts "c:i:d:s:a:D:U:o:r:u:R:t:l:H:P:q:b:e:n:S:h" OPTION
do
case $OPTION in
h) command="help" ;;
c) command="$OPTARG" ;;
i) instance="$OPTARG" ;;

可以看出-i 的值给了instance变量. 在main行数中会先处理instance

1
2
3
4
5
6
7
8
9
function main {
check_requirements
detect_leader_api

instance_hostport=$(to_hostport $instance)
destination_hostport=$(to_hostport $destination)

run_command
}

MySQL page cleaner占用CPU较高问题

背景说明

众所周知, Seconds_Behind_Master 无法准确反应复制延迟. 为了准确的反应复制延迟, 业界的办法是, 创建一个延迟监控表, 周期性(往往是每秒)更新这个表的时间戳字段, 计算当前时间与该字段差值, 以此判断复制延迟. 典型的例子是Percona的pt-heartbeat. 另外TIDB DM也使用了相同的方法监控同步延迟.

在我们这里, 使用了主从和MGR两种架构集群, 为了更好地监控延迟, DBA开发了一个python脚本, 脚本从CMDB获取所有集群ProxySQL节点, 连接ProxySQL, 每秒更新dbms_monitor.monitor_delay表. 表结构和执行的语句为:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
root@localhost 13:23:42 [dbms_monitor]> show create table monitor_delay\G
*************************** 1. row ***************************
Table: monitor_delay
Create Table: CREATE TABLE `monitor_delay` (
`id` tinyint(3) unsigned NOT NULL,
`ctime` datetime NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci
1 row in set (0.00 sec)



更新时间戳语句
REPLACE INTO dbms_monitor.monitor_delay(id) VALUES(1)

为了保证”每秒”更新, 脚本做了超时处理, 如果某个集群执行语句(连接超时/执行语句超时等)超时, 则抛出异常, sleep 一段时间(这个一段时间+ 本次循环已用时间=1s), 进入下次循环

8.0MGR Subquery returns more than 1 row bug

这是一个Percona Server8.0.22下使用MGR+ProxySQL时遇到的bug

使用mydumper备份mgr时发现ProxySQL报错了

1
2
3
4
5
6
7
# cat proxysql.log | grep -i subquery
2021-04-28 15:14:22 MySQL_HostGroups_Manager.cpp:3875:update_group_replication_set_offline(): [WARNING] Group Replication: setting host 172.16.23.224:3309 offline because: Subquery returns more than 1 row
2021-04-28 15:14:22 MySQL_Monitor.cpp:1472:monitor_group_replication_thread(): [ERROR] Got error. mmsd 0x7fd62cee3540 , MYSQL 0x7fd62b804600 , FD 39 : Subquery returns more than 1 row
2021-04-28 15:14:27 MySQL_Monitor.cpp:1472:monitor_group_replication_thread(): [ERROR] Got error. mmsd 0x7fd62cee1440 , MYSQL 0x7fd62b800000 , FD 41 : Subquery returns more than 1 row
2021-04-28 15:17:37 MySQL_HostGroups_Manager.cpp:3875:update_group_replication_set_offline(): [WARNING] Group Replication: setting host 172.16.23.151:3309 offline because: Subquery returns more than 1 row
2021-04-28 15:17:37 MySQL_Monitor.cpp:1472:monitor_group_replication_thread(): [ERROR] Got error. mmsd 0x7fd62cee78c0 , MYSQL 0x7fd62bc06400 , FD 51 : Subquery returns more than 1 row
2021-04-28 15:17:42 MySQL_Monitor.cpp:1472:monitor_group_replication_thread(): [ERROR] Got error. mmsd 0x7fd62cee0300 , MYSQL 0x7fd62bc00000 , FD 40 : Subquery returns more than 1 row

skeema简单使用

简介

Skeema is a tool for managing MySQL tables and schema changes in a declarative fashion using pure SQL. It provides a CLI tool allowing you to:

  • Export CREATE TABLE statements to the filesystem, for tracking in a repo (git, hg, svn, etc)
  • Diff changes in the schema repo against live DBs to automatically generate DDL
  • Manage multiple environments (e.g. dev, staging, prod) and keep them in sync with ease
  • Configure use of online schema change tools, such as pt-online-schema-change, for performing ALTERs
  • Convert non-online migrations from frameworks like Rails or Django into online schema changes in production

Skeema supports a pull-request-based workflow for schema change submission, review, and execution. This permits your team to manage schema changes in exactly the same way as you manage code changes. Our new companion Cloud Linter for GitHub repos provides automatic linting of schema change commits and pull requests.

我这的需求是同步生产环境表结构到演练环境, 以下是针对我这个场景的具体使用方法

权限

skeema需要的用户权限具体见https://www.skeema.io/docs/requirements/

这里说一下重点, skeema diff时会在目标环境创建一个_skeema_tmp临时数据库, 然后会删除这个库


Powered by Hexo and Hexo-theme-hiker

Copyright © 2013 - 2026 Fan() All Rights Reserved.

访客数 : | 访问量 :