
测试背景
以下所有测试,全部基于以下复制结构完成
1 2 3 |
host_1(host_1:3306) (current master) +--host_2(host_2:3306) +--host_3(host_3:3306) |
一、MHA安装
mha node 所有服务器节点都要安装
mha manager 只需要安装在manager节点
- MHA NODE
1 2 |
yum install perl-DBD-MySQL rpm -ivh mha4mysql-node-0.56-0.el6.noarch.rpm |
- MHA Manager
1 2 3 4 5 6 |
yum install perl-DBD-MySQL yum install perl-Config-Tiny yum install perl-Log-Dispatch yum install perl-Parallel-ForkManager yum install perl-Time-HiRes rpm -ivh mha4mysql-manager-0.56-0.el6.noarch.rpm |
- MHA rpm安装路径
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
* Node Node> rpm -qpl mha4mysql-node-0.56-0.el6.noarch.rpm /usr/bin/apply_diff_relay_logs /usr/bin/filter_mysqlbinlog /usr/bin/purge_relay_logs /usr/bin/save_binary_logs /usr/share/man/man1/apply_diff_relay_logs.1.gz /usr/share/man/man1/filter_mysqlbinlog.1.gz /usr/share/man/man1/purge_relay_logs.1.gz /usr/share/man/man1/save_binary_logs.1.gz /usr/share/perl5/vendor_perl/MHA/BinlogHeaderParser.pm /usr/share/perl5/vendor_perl/MHA/BinlogManager.pm /usr/share/perl5/vendor_perl/MHA/BinlogPosFindManager.pm /usr/share/perl5/vendor_perl/MHA/BinlogPosFinder.pm /usr/share/perl5/vendor_perl/MHA/BinlogPosFinderElp.pm /usr/share/perl5/vendor_perl/MHA/BinlogPosFinderXid.pm /usr/share/perl5/vendor_perl/MHA/NodeConst.pm /usr/share/perl5/vendor_perl/MHA/NodeUtil.pm /usr/share/perl5/vendor_perl/MHA/SlaveUtil.pm * Manager Manager> rpm -qpl mha4mysql-manager-0.56-0.el6.noarch.rpm /usr/bin/masterha_check_repl /usr/bin/masterha_check_ssh /usr/bin/masterha_check_status /usr/bin/masterha_conf_host /usr/bin/masterha_manager /usr/bin/masterha_master_monitor /usr/bin/masterha_master_switch /usr/bin/masterha_secondary_check /usr/bin/masterha_stop /usr/share/man/man1/masterha_check_repl.1.gz /usr/share/man/man1/masterha_check_ssh.1.gz /usr/share/man/man1/masterha_check_status.1.gz /usr/share/man/man1/masterha_conf_host.1.gz /usr/share/man/man1/masterha_manager.1.gz /usr/share/man/man1/masterha_master_monitor.1.gz /usr/share/man/man1/masterha_master_switch.1.gz /usr/share/man/man1/masterha_secondary_check.1.gz /usr/share/man/man1/masterha_stop.1.gz /usr/share/perl5/vendor_perl/MHA/Config.pm /usr/share/perl5/vendor_perl/MHA/DBHelper.pm /usr/share/perl5/vendor_perl/MHA/FileStatus.pm /usr/share/perl5/vendor_perl/MHA/HealthCheck.pm /usr/share/perl5/vendor_perl/MHA/ManagerAdmin.pm /usr/share/perl5/vendor_perl/MHA/ManagerAdminWrapper.pm /usr/share/perl5/vendor_perl/MHA/ManagerConst.pm /usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm /usr/share/perl5/vendor_perl/MHA/MasterFailover.pm /usr/share/perl5/vendor_perl/MHA/MasterMonitor.pm /usr/share/perl5/vendor_perl/MHA/MasterRotate.pm /usr/share/perl5/vendor_perl/MHA/SSHCheck.pm /usr/share/perl5/vendor_perl/MHA/Server.pm /usr/share/perl5/vendor_perl/MHA/ServerManager.pm |
二、MHA配置文件
- global scope
1 2 3 4 5 6 7 8 9 10 11 |
* /etc/masterha_default.cnf [server default] -- 这下面的都是全局配置,适用于所有app.cnf user=dba password=dba ssh_user=root master_binlog_dir= /data/mysql.bin secondary_check_script= masterha_secondary_check -s remote_host1 -s remote_host2 ping_interval=3 master_ip_failover_script=/script/masterha/master_ip_failover shutdown_script= /script/masterha/power_manager report_script= /script/masterha/send_master_failover_mail |
- application scope
1 2 3 4 5 6 |
* /etc/app1.cnf [server default] remote_workdir=/var/log/masterha/app1 manager_workdir=/var/log/masterha/app1 manager_log=/var/log/masterha/app1/app1.log [server default] 里面的这些配置,会影响到下面的N 个 server |
- local scope
1 2 3 4 5 6 7 8 9 10 11 |
* /etc/app1.cnf [server1] hostname=192.168.1.10 candidate_master=1 [server2] hostname=192.168.1.11 candidate_master=1 [server3] hostname=192.168.1.12 no_master=1 每个server下面的配置,专属于某个server |
三、前提要求(必须满足)
3.1 SSH公钥认证
基本上MHA manager,MNA node,以及二次检测的节点,都需要互相信任
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
# masterha_check_ssh --conf=/etc/app1.cnf Sat May 14 14:42:19 2011 - [warn] Global configuration file /etc/masterha_default.cnf not found. Skipping. Sat May 14 14:42:19 2011 - [info] Reading application default configurations from /etc/app1.cnf.. Sat May 14 14:42:19 2011 - [info] Reading server configurations from /etc/app1.cnf.. Sat May 14 14:42:19 2011 - [info] Starting SSH connection tests.. Sat May 14 14:42:19 2011 - [debug] Connecting via SSH from root@host1(192.168.0.1) to root@host2(192.168.0.2).. Sat May 14 14:42:20 2011 - [debug] ok. Sat May 14 14:42:20 2011 - [debug] Connecting via SSH from root@host1(192.168.0.1) to root@host3(192.168.0.3).. Sat May 14 14:42:20 2011 - [debug] ok. Sat May 14 14:42:21 2011 - [debug] Connecting via SSH from root@host2(192.168.0.2) to root@host1(192.168.0.1).. Sat May 14 14:42:21 2011 - [debug] ok. Sat May 14 14:42:21 2011 - [debug] Connecting via SSH from root@host2(192.168.0.2) to root@host3(192.168.0.3).. Sat May 14 14:42:21 2011 - [debug] ok. Sat May 14 14:42:22 2011 - [debug] Connecting via SSH from root@host3(192.168.0.3) to root@host1(192.168.0.1).. Sat May 14 14:42:22 2011 - [debug] ok. Sat May 14 14:42:22 2011 - [debug] Connecting via SSH from root@host3(192.168.0.3) to root@host2(192.168.0.2).. Sat May 14 14:42:22 2011 - [debug] ok. Sat May 14 14:42:22 2011 - [info] All SSH connection tests passed successfully. |
如果slave比较多,实例比较多,最好提高下 /etc/ssh/sshd_config MaxStartups 的值(默认是10)
3.2 操作系统
仅在Linux上测试过
3.3 单写master和多slave或者只读master
打从一开始,MHA就是为了解决数据一致性而出生,所以,最好是多个slave
如果你只有一个slave,根本就碰不到数据一致性问题,也就不需要mha了
如果是一个slave,用版同步复制也能解决
从0.52开始,MHA就支持多master的复制架构了,下面列举了多master环境下注意点
- 多master,但是只允许单点写入
- 默认情况下,只支持2层复制架构
3.4 三层或者多层复制环境
默认情况下,MHA是不支持3层或多层复制架构的(Master1 -> Master2 -> Slave3)
MHA可以恢复Master2,但是不能恢复Slave3,因为Master2,Slave3有不同的master
为了让MHA支持以上架构,可以参考如下配置:
- 在配置文件中,只配置两层(master1 and master2)
- 使用 “multi_tier_slave=1” 参数,然后设置所有hosts
3.5 MySQL版本必须是5.0 或者高于 5.0
- MySQL版本必须大于等于5.0
- 尽量使用高版本的MySQL
3.6 使用mysqlbinlog 5.1+ 支持MySQL5.1+
- MHA使用mysqlbinlog来应用日志到目标slave上的
- 如果MySQL master设置的是row格式,那么MySQL必须是大于等于5.1版本,因为5.0不支持row
- mysqlbinlog版本可以这样被检测:
1 2 |
[app@slave_host1]$ mysqlbinlog --version mysqlbinlog Ver 3.3 for unknown-linux-gnu at x86_64 |
- 如果你使用的是MySQL5.1,那么mysqlbinlog必须大于等于3.3
- 如果mysqlbinlog的版本是3.2,而mysql的版本是5.1,那么mha manager会报错,且停止monitoring
3.7 log-bin必须在候选master上开启
- 如果当前slave没有设置log-bin,那么很显然它不能成为提升为new master
- 如果没有任何机器设置了log-bin,那么mha会报错且停止failover
3.8 binlog,relay-log 主从环境必须全部一致
- 复制过滤规则(binlog-do-db, replicate-ignore-db 等等)必须全部一致
3.9 复制用户必须在候选master上要存在
- 切换完成后,所有slave都必须执行change master 命令。在new master上复制用户必须有(REPLICATEION SLAVE权限)
3.10 使用purge_relay_logs来定期删除relay logs
默认情况下,如果SQL线程执行完relay-log,relay logs就会被自动删除。
但是这些relay-logs 也许还会用来恢复其他的slave,所以你需要关闭自动删除relay-logs的purge线程,然后自己阶段性的来删除
如果是你自己来删的话,必须考虑repl 延迟问题
最好让slave删除relay log不要在同一时间点,假如需要恢复,那么这个时间点所有relay logs都被删除了就不好了
3.11 不要在SBR的环境中使用load data infile
不管是SBR,还是RBR,最好不要使用load data
四、实战演练
- [功能测试] 测试场景一、手工操作,手动完成,failover
- [功能测试] 测试场景二、手工操作, 自动完成,failover
- [功能测试] 测试场景三、手工操作,自动完成,online master switch
- [功能测试] 测试场景四、自动监控,自动操作,自动完成,failover
- [用例测试] MySQL master 服务 down掉,是否成功自动failover
- [用例测试] MySQL master too many connection,无权限,响应慢,是否自动failover
- [用例测试] MySQL master 服务器 down掉,且候选master落后的最多, 是否自动failover,知否可以成功的做日志补偿
- [用例测试] MySQL slave 服务 down掉,是否自动failover
- [用例测试] MySQL 有一台slave,或者多台slave服务器延迟很大,是否自动failover
- [用例测试] MySQL slave IO/SQL线程 stop,是否自动failover
- [用例测试] MySQL slave IO/SQL线程 报错,是否自动failover
- [用例测试] MySQL master 有大事务超过100s再执行,是否可以online master switch
- [用例测试] MySQL master 网络断掉,是否自动failover
- [用例测试] MySQL master 网路瞬断(1~30秒),是否自动failover
- [用例测试] MySQL master 和 候选master 网络都挂掉的情况,是否自动failover
- [用例测试] MHA manager 和 MySQL master之间的网络断掉,但是master和slave之间的网络是好的,是否自动failover
- [用例测试] MHA manager 和 MySQL master,slave 网络都断掉的情况,是否自动failover
- [用例测试] GTID模式下,还需要relay-log吗?是否能够成功的补齐日志
- [用例测试] 多线程复制模式下,做failover 和 online master switch,会不会有问题呢?
- [用例测试] 在一开始没有开启MHA的group中,如何做到日志补偿,然后change master呢?
4.0 简单的failover过程
- step1:安装MHA node在所有节点上
- step2:安装MHA manager
- step3:创建配置文件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
manager_host$ cat /etc/app1.cnf [server default] # mysql user and password user=root password=mysqlpass ssh_user=root # working directory on the manager manager_workdir=/var/log/masterha/app1 # working directory on MySQL servers remote_workdir=/var/log/masterha/app1 [server1] hostname=host1 [server2] hostname=host2 [server3] hostname=host3 |
- step4: 检查SSH 互信
1 |
# masterha_check_ssh --conf=/etc/app1.cnf |
- step5: 检查复制配置
1 |
manager_host$ masterha_check_repl --conf=/etc/app1.cnf |
- step6:开启Manager
1 |
manager_host$ masterha_manager --conf=/etc/app1.cnf |
- step7: 检查manager的状态
1 |
manager_host$ masterha_check_status --conf=/etc/app1.cnf |
- step8: 测试关闭manager
1 |
manager_host$ masterha_stop --conf=/etc/app1.cnf |
- step9:测试master failover
1 |
host1$ killall -9 mysqld mysqld_safe |
- 未完成的步骤(进阶)
1 2 3 4 5 6 7 |
* 二次检测: 检查master是否真的挂了,避免脑裂 secondary_network_script * master_ip_failover: 默认的是空,什么都不做 ip漂移 新master赋予写 日志保留 报警 等等 |
4.2 测试场景一、 手工操作,手动完成,failover
- step1: 检查各种MHA manager的各种环境
1 2 3 4 5 6 7 8 9 10 11 12 |
* 检查SSH masterha_check_ssh --conf=/etc/app1.cnf * 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告) masterha_check_repl --conf=/etc/app1.cnf Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306). Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). * 检查mha状态 masterha_check_status --conf=/etc/app1.cnf * 开启mha 监控 nohup masterha_manager --conf=/etc/app1.cnf & * 关闭mha 监控 masterha_stop --conf=/etc/app1.cnf |
- step2: 手工制造master mysql挂了
1 |
killall -9 mysqld mysqld_safe |
- step3: 人工交互式failover
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 |
masterha_master_switch --master_state=dead --conf=/etc/app1.cnf --dead_master_host=host_1 --interactive=1 --ignore_last_failover * 切换日志如下: --dead_master_ip=<dead_master_ip> is not set. Using host_1. --dead_master_port=<dead_master_port> is not set. Using 3306. Thu Jul 28 16:37:04 2016 - [info] Reading default configuration from /etc/masterha_default.cnf.. Thu Jul 28 16:37:04 2016 - [info] Reading application default configuration from /etc/app1.cnf.. Thu Jul 28 16:37:04 2016 - [info] Reading server configuration from /etc/app1.cnf.. Thu Jul 28 16:37:04 2016 - [info] MHA::MasterFailover version 0.56. Thu Jul 28 16:37:04 2016 - [info] Starting master failover. Thu Jul 28 16:37:04 2016 - [info] Thu Jul 28 16:37:04 2016 - [info] * Phase 1: Configuration Check Phase.. Thu Jul 28 16:37:04 2016 - [info] Thu Jul 28 16:37:04 2016 - [info] GTID failover mode = 0 Thu Jul 28 16:37:04 2016 - [info] Dead Servers: Thu Jul 28 16:37:04 2016 - [info] host_1(host_1:3306) Thu Jul 28 16:37:04 2016 - [info] Checking master reachability via MySQL(double check)... Thu Jul 28 16:37:04 2016 - [info] ok. Thu Jul 28 16:37:04 2016 - [info] Alive Servers: Thu Jul 28 16:37:04 2016 - [info] host_2(host_2:3306) Thu Jul 28 16:37:04 2016 - [info] host_3(host_3:3306) Thu Jul 28 16:37:04 2016 - [info] Alive Slaves: Thu Jul 28 16:37:04 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:04 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:04 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 16:37:04 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:04 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:04 2016 - [info] Not candidate for the new Master (no_master is set) Master host_1(host_1:3306) is dead. Proceed? (yes/NO): yes Thu Jul 28 16:37:06 2016 - [info] Starting Non-GTID based failover. Thu Jul 28 16:37:06 2016 - [info] Thu Jul 28 16:37:06 2016 - [info] ** Phase 1: Configuration Check Phase completed. Thu Jul 28 16:37:06 2016 - [info] Thu Jul 28 16:37:06 2016 - [info] * Phase 2: Dead Master Shutdown Phase.. Thu Jul 28 16:37:06 2016 - [info] Thu Jul 28 16:37:06 2016 - [info] HealthCheck: SSH to host_1 is reachable. Thu Jul 28 16:37:07 2016 - [info] Forcing shutdown so that applications never connect to the current master.. Thu Jul 28 16:37:07 2016 - [info] Executing master IP deactivation script: Thu Jul 28 16:37:07 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root Thu Jul 28 16:37:07 2016 - [info] done. Thu Jul 28 16:37:07 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Thu Jul 28 16:37:07 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed. Thu Jul 28 16:37:07 2016 - [info] Thu Jul 28 16:37:07 2016 - [info] * Phase 3: Master Recovery Phase.. Thu Jul 28 16:37:07 2016 - [info] Thu Jul 28 16:37:07 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Thu Jul 28 16:37:07 2016 - [info] Thu Jul 28 16:37:07 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000004:154 Thu Jul 28 16:37:07 2016 - [info] Latest slaves (Slaves that received relay log files to the latest): Thu Jul 28 16:37:07 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:07 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 16:37:07 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:07 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 16:37:07 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000004:154 Thu Jul 28 16:37:07 2016 - [info] Oldest slaves: Thu Jul 28 16:37:07 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:07 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 16:37:07 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:07 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 16:37:07 2016 - [info] Thu Jul 28 16:37:07 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Thu Jul 28 16:37:07 2016 - [info] Thu Jul 28 16:37:07 2016 - [info] Fetching dead master's binary logs.. Thu Jul 28 16:37:07 2016 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1_name.000004 --start_pos=154 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728163704.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 Creating /var/log/masterha/app1 if not exists.. ok. Concat binary/relay logs from host_1_name.000004 pos 154 to host_1_name.000004 EOF into /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728163704.binlog .. Binlog Checksum enabled Dumping binlog format description event, from position 0 to 154.. ok. No need to dump effective binlog data from /data/mysql.bin/host_1_name.000004 (pos starts 154, filesize 154). Skipping. Binlog Checksum enabled /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728163704.binlog has no effective data events. Event not exists. Thu Jul 28 16:37:07 2016 - [info] Additional events were not found from the orig master. No need to save. Thu Jul 28 16:37:07 2016 - [info] Thu Jul 28 16:37:07 2016 - [info] * Phase 3.3: Determining New Master Phase.. Thu Jul 28 16:37:07 2016 - [info] Thu Jul 28 16:37:07 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Thu Jul 28 16:37:07 2016 - [info] All slaves received relay logs to the same position. No need to resync each other. Thu Jul 28 16:37:07 2016 - [info] Searching new master from slaves.. Thu Jul 28 16:37:07 2016 - [info] Candidate masters from the configuration file: Thu Jul 28 16:37:07 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:07 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 16:37:07 2016 - [info] Non-candidate masters: Thu Jul 28 16:37:07 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:37:07 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 16:37:07 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 16:37:07 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Thu Jul 28 16:37:07 2016 - [info] New master is host_2(host_2:3306) Thu Jul 28 16:37:07 2016 - [info] Starting master failover.. Thu Jul 28 16:37:07 2016 - [info] From: host_1(host_1:3306) (current master) +--host_2(host_2:3306) +--host_3(host_3:3306) To: host_2(host_2:3306) (new master) +--host_3(host_3:3306) Starting master switch from host_1(host_1:3306) to host_2(host_2:3306)? (yes/NO): yes Thu Jul 28 16:37:09 2016 - [info] New master decided manually is host_2(host_2:3306) Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] * Phase 3.4: Master Log Apply Phase.. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Thu Jul 28 16:37:09 2016 - [info] Starting recovery on host_2(host_2:3306).. Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. Waiting all logs to be applied.. Thu Jul 28 16:37:09 2016 - [info] done. Thu Jul 28 16:37:09 2016 - [info] All relay logs were successfully applied. Thu Jul 28 16:37:09 2016 - [info] Getting new master's binlog name and position.. Thu Jul 28 16:37:09 2016 - [info] host_2_name.000002:294 Thu Jul 28 16:37:09 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2_name.000002', MASTER_LOG_POS=294, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Thu Jul 28 16:37:09 2016 - [info] Executing master IP activate script: Thu Jul 28 16:37:09 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba' Set read_only=0 on the new master. No need to Creating app user on the new master.. Thu Jul 28 16:37:09 2016 - [info] OK. Thu Jul 28 16:37:09 2016 - [info] ** Finished master recovery successfully. Thu Jul 28 16:37:09 2016 - [info] * Phase 3: Master Recovery Phase completed. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] * Phase 4: Slaves Recovery Phase.. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 6211. Check tmp log /var/log/masterha/app1/host_3_3306_20160728163704.log if it takes time.. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] Log messages from host_3 ... Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Thu Jul 28 16:37:09 2016 - [info] End of log messages from host_3. Thu Jul 28 16:37:09 2016 - [info] -- host_3(host_3:3306) has the latest relay log events. Thu Jul 28 16:37:09 2016 - [info] Generating relay diff files from the latest slave succeeded. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Thu Jul 28 16:37:09 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 6216. Check tmp log /var/log/masterha/app1/host_3_3306_20160728163704.log if it takes time.. Thu Jul 28 16:37:10 2016 - [info] Thu Jul 28 16:37:10 2016 - [info] Log messages from host_3 ... Thu Jul 28 16:37:10 2016 - [info] Thu Jul 28 16:37:09 2016 - [info] Starting recovery on host_3(host_3:3306).. Thu Jul 28 16:37:09 2016 - [info] This server has all relay logs. Waiting all logs to be applied.. Thu Jul 28 16:37:09 2016 - [info] done. Thu Jul 28 16:37:09 2016 - [info] All relay logs were successfully applied. Thu Jul 28 16:37:09 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306).. Thu Jul 28 16:37:09 2016 - [info] Executed CHANGE MASTER. Thu Jul 28 16:37:10 2016 - [info] Slave started. Thu Jul 28 16:37:10 2016 - [info] End of log messages from host_3. Thu Jul 28 16:37:10 2016 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded. Thu Jul 28 16:37:10 2016 - [info] All new slave servers recovered successfully. Thu Jul 28 16:37:10 2016 - [info] Thu Jul 28 16:37:10 2016 - [info] * Phase 5: New master cleanup phase.. Thu Jul 28 16:37:10 2016 - [info] Thu Jul 28 16:37:10 2016 - [info] Resetting slave info on the new master.. Thu Jul 28 16:37:10 2016 - [info] host_2: Resetting slave info succeeded. Thu Jul 28 16:37:10 2016 - [info] Master failover to host_2(host_2:3306) completed successfully. Thu Jul 28 16:37:10 2016 - [info] ----- Failover Report ----- app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded Master host_1(host_1:3306) is down! Check MHA Manager logs at host_manager_name for details. Started manual(interactive) failover. Invalidated master IP address on host_1(host_1:3306) The latest slave host_2(host_2:3306) has all relay logs for recovery. Selected host_2(host_2:3306) as a new master. host_2(host_2:3306): OK: Applying all logs succeeded. host_2(host_2:3306): OK: Activated master IP address. host_3(host_3:3306): This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306) host_2(host_2:3306): Resetting slave info succeeded. Master failover to host_2(host_2:3306) completed successfully. |
4.3 [功能测试] 测试场景二、手工操作, 自动完成,failover
- step1: 检查各种MHA manager的各种环境
1 2 3 4 5 6 7 8 9 10 11 12 |
* 检查SSH masterha_check_ssh --conf=/etc/app1.cnf * 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告) masterha_check_repl --conf=/etc/app1.cnf Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306). Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). * 检查mha状态 masterha_check_status --conf=/etc/app1.cnf * 开启mha 监控 nohup masterha_manager --conf=/etc/app1.cnf & * 关闭mha 监控 masterha_stop --conf=/etc/app1.cnf |
- step2: 手工制造master mysql挂了
1 |
killall -9 mysqld mysqld_safe |
- step3: 人工-非交互式 failover
1 2 3 4 5 6 7 |
masterha_master_switch --master_state=dead --conf=/etc/app1.cnf --dead_master_host=host_1 --interactive=0 --ignore_last_failover * 切换日志如下: --dead_master_ip=<dead_master_ip> is not set. Using host_1. --dead_master_port=<dead_master_port> is not set. Using 3306. Thu Jul 28 16:33:21 2016 - [info] Reading default configuration from /etc/masterha_default.cnf.. Thu Jul 28 16:33:21 2016 - [info] Reading application default configuration from /etc/app1.cnf.. Thu Jul 28 16:33:21 2016 - [info] Reading server configuration from /etc/app1.cnf.. |
4.4 [功能测试] 测试场景三、手工操作,自动完成,online master switch
- step1: 检查各种MHA manager的各种环境
1 2 3 4 5 6 7 8 9 10 |
* 检查SSH masterha_check_ssh --conf=/etc/app1.cnf * 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告) masterha_check_repl --conf=/etc/app1.cnf Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306). Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). * 检查mha状态 masterha_check_status --conf=/etc/app1.cnf * 开启mha 监控 nohup masterha_manager --conf=/etc/app1.cnf & |
- step2: 手工制造master mysql挂了
1 |
killall -9 mysqld mysqld_safe |
- step3: 手工操作,自动完成,online master switch
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
masterha_master_switch --master_state=alive --conf=/etc/app1.cnf --orig_master_is_new_slave --interactive=0 * 在online master切换过程中,如果MHA manager 开启,那么切换不会成功 Thu Jul 28 16:13:57 2016 - [info] Checking MHA is not monitoring or doing failover.. Thu Jul 28 16:13:57 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln142] Getting advisory lock failed on the current master. MHA Monitor runs on the current master. Stop MHA Manager/Monitor and try again. Thu Jul 28 16:13:57 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53 * 在online master切换过程中,如果设置--orig_master_is_new_slave, 但是没有设置repl_password,就会报错 Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln784] Slave could not be started on host_1(host_1:3306)! Check slave status. Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/Server.pm, ln862] Starting slave IO/SQL thread on host_1(host_1:3306) failed! Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln573] Failed! Thu Jul 28 16:15:13 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterRotate.pm, ln602] Switching master to host_2(host_2:3306) done, but switching slaves partially failed. * 切换成功的日志如下: Thu Jul 28 16:23:25 2016 - [info] MHA::MasterRotate version 0.56. Thu Jul 28 16:23:25 2016 - [info] Starting online master switch.. Thu Jul 28 16:23:25 2016 - [info] Thu Jul 28 16:23:25 2016 - [info] * Phase 1: Configuration Check Phase.. Thu Jul 28 16:23:25 2016 - [info] Thu Jul 28 16:23:25 2016 - [info] Reading default configuration from /etc/masterha_default.cnf.. Thu Jul 28 16:23:25 2016 - [info] Reading application default configuration from /etc/app1.cnf.. Thu Jul 28 16:23:25 2016 - [info] Reading server configuration from /etc/app1.cnf.. Thu Jul 28 16:23:25 2016 - [info] GTID failover mode = 0 Thu Jul 28 16:23:25 2016 - [info] Current Alive Master: host_2(host_2:3306) Thu Jul 28 16:23:25 2016 - [info] Alive Slaves: Thu Jul 28 16:23:25 2016 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306) Thu Jul 28 16:23:25 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 16:23:25 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306) Thu Jul 28 16:23:25 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 16:23:25 2016 - [info] Executing FLUSH NO_WRITE_TO_BINLOG TABLES. This may take long time.. Thu Jul 28 16:23:25 2016 - [info] ok. Thu Jul 28 16:23:25 2016 - [info] Checking MHA is not monitoring or doing failover.. Thu Jul 28 16:23:25 2016 - [info] Checking replication health on host_1.. Thu Jul 28 16:23:25 2016 - [info] ok. Thu Jul 28 16:23:25 2016 - [info] Checking replication health on host_3.. Thu Jul 28 16:23:25 2016 - [info] ok. Thu Jul 28 16:23:25 2016 - [info] Searching new master from slaves.. Thu Jul 28 16:23:25 2016 - [info] Candidate masters from the configuration file: Thu Jul 28 16:23:25 2016 - [info] host_1(host_1:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306) Thu Jul 28 16:23:25 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 16:23:25 2016 - [info] host_2(host_2:3306) Version=5.7.13-log log-bin:enabled Thu Jul 28 16:23:25 2016 - [info] Non-candidate masters: Thu Jul 28 16:23:25 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 16:23:25 2016 - [info] Replicating from host_2(host_2:3306) Thu Jul 28 16:23:25 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 16:23:25 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Thu Jul 28 16:23:25 2016 - [info] From: host_2(host_2:3306) (current master) +--host_1(host_1:3306) +--host_3(host_3:3306) To: host_1(host_1:3306) (new master) +--host_3(host_3:3306) +--host_2(host_2:3306) Thu Jul 28 16:23:25 2016 - [info] Checking whether host_1(host_1:3306) is ok for the new master.. Thu Jul 28 16:23:25 2016 - [info] ok. Thu Jul 28 16:23:25 2016 - [info] host_2(host_2:3306): SHOW SLAVE STATUS returned empty result. To check replication filtering rules, temporarily executing CHANGE MASTER to a dummy host. Thu Jul 28 16:23:25 2016 - [info] host_2(host_2:3306): Resetting slave pointing to the dummy host. Thu Jul 28 16:23:25 2016 - [info] ** Phase 1: Configuration Check Phase completed. Thu Jul 28 16:23:25 2016 - [info] Thu Jul 28 16:23:25 2016 - [info] * Phase 2: Rejecting updates Phase.. Thu Jul 28 16:23:25 2016 - [info] Thu Jul 28 16:23:25 2016 - [warning] master_ip_online_change_script is not defined. Skipping disabling writes on the current master. Thu Jul 28 16:23:25 2016 - [info] Locking all tables on the orig master to reject updates from everybody (including root): Thu Jul 28 16:23:25 2016 - [info] Executing FLUSH TABLES WITH READ LOCK.. Thu Jul 28 16:23:25 2016 - [info] ok. Thu Jul 28 16:23:25 2016 - [info] Orig master binlog:pos is host_2_name.000002:294. Thu Jul 28 16:23:25 2016 - [info] Waiting to execute all relay logs on host_1(host_1:3306).. Thu Jul 28 16:23:25 2016 - [info] master_pos_wait(host_2_name.000002:294) completed on host_1(host_1:3306). Executed 0 events. Thu Jul 28 16:23:25 2016 - [info] done. Thu Jul 28 16:23:25 2016 - [info] Getting new master's binlog name and position.. Thu Jul 28 16:23:25 2016 - [info] host_1_name.000004:154 Thu Jul 28 16:23:25 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_1', MASTER_PORT=3306, MASTER_LOG_FILE='host_1_name.000004', MASTER_LOG_POS=154, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Thu Jul 28 16:23:25 2016 - [info] Setting read_only=0 on host_1(host_1:3306).. Thu Jul 28 16:23:26 2016 - [info] ok. Thu Jul 28 16:23:26 2016 - [info] Thu Jul 28 16:23:26 2016 - [info] * Switching slaves in parallel.. Thu Jul 28 16:23:26 2016 - [info] Thu Jul 28 16:23:26 2016 - [info] -- Slave switch on host host_3(host_3:3306) started, pid: 32120 Thu Jul 28 16:23:26 2016 - [info] Thu Jul 28 16:23:26 2016 - [info] Log messages from host_3 ... Thu Jul 28 16:23:26 2016 - [info] Thu Jul 28 16:23:26 2016 - [info] Waiting to execute all relay logs on host_3(host_3:3306).. Thu Jul 28 16:23:26 2016 - [info] master_pos_wait(host_2_name.000002:294) completed on host_3(host_3:3306). Executed 0 events. Thu Jul 28 16:23:26 2016 - [info] done. Thu Jul 28 16:23:26 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_1(host_1:3306).. Thu Jul 28 16:23:26 2016 - [info] Executed CHANGE MASTER. Thu Jul 28 16:23:26 2016 - [info] Slave started. Thu Jul 28 16:23:26 2016 - [info] End of log messages from host_3 ... Thu Jul 28 16:23:26 2016 - [info] Thu Jul 28 16:23:26 2016 - [info] -- Slave switch on host host_3(host_3:3306) succeeded. Thu Jul 28 16:23:26 2016 - [info] Unlocking all tables on the orig master: Thu Jul 28 16:23:26 2016 - [info] Executing UNLOCK TABLES.. Thu Jul 28 16:23:26 2016 - [info] ok. Thu Jul 28 16:23:26 2016 - [info] Starting orig master as a new slave.. Thu Jul 28 16:23:26 2016 - [info] Resetting slave host_2(host_2:3306) and starting replication from the new master host_1(host_1:3306).. Thu Jul 28 16:23:26 2016 - [info] Executed CHANGE MASTER. Thu Jul 28 16:23:26 2016 - [info] Slave started. Thu Jul 28 16:23:26 2016 - [info] All new slave servers switched successfully. Thu Jul 28 16:23:26 2016 - [info] Thu Jul 28 16:23:26 2016 - [info] * Phase 5: New master cleanup phase.. Thu Jul 28 16:23:26 2016 - [info] Thu Jul 28 16:23:26 2016 - [info] host_1: Resetting slave info succeeded. Thu Jul 28 16:23:26 2016 - [info] Switching master to host_1(host_1:3306) completed successfully. |
4.5 测试场景四、自动监控,自动操作,自动完成,failover
- step1: 检查各种MHA manager的各种环境
1 2 3 4 5 6 7 8 9 10 |
* 检查SSH masterha_check_ssh --conf=/etc/app1.cnf * 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告) masterha_check_repl --conf=/etc/app1.cnf Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306). Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). * 检查mha状态 masterha_check_status --conf=/etc/app1.cnf * 开启mha 监控 nohup masterha_manager --conf=/etc/app1.cnf & |
- step2: 手工制造master mysql挂了
1 |
killall -9 mysqld mysqld_safe |
- step3: 自动监控,自动操作,自动完成,自动failover
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 |
Thu Jul 28 17:10:25 2016 - [info] MHA::MasterMonitor version 0.56. Thu Jul 28 17:10:26 2016 - [info] GTID failover mode = 0 Thu Jul 28 17:10:26 2016 - [info] Dead Servers: Thu Jul 28 17:10:26 2016 - [info] Alive Servers: Thu Jul 28 17:10:26 2016 - [info] host_1(host_1:3306) Thu Jul 28 17:10:26 2016 - [info] host_2(host_2:3306) Thu Jul 28 17:10:26 2016 - [info] host_3(host_3:3306) Thu Jul 28 17:10:26 2016 - [info] Alive Slaves: Thu Jul 28 17:10:26 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:10:26 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:10:26 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 17:10:26 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:10:26 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:10:26 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 17:10:26 2016 - [info] Current Alive Master: host_1(host_1:3306) Thu Jul 28 17:10:26 2016 - [info] Checking slave configurations.. Thu Jul 28 17:10:26 2016 - [info] read_only=1 is not set on slave host_2(host_2:3306). Thu Jul 28 17:10:26 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). Thu Jul 28 17:10:26 2016 - [info] Checking replication filtering settings.. Thu Jul 28 17:10:26 2016 - [info] binlog_do_db= , binlog_ignore_db= Thu Jul 28 17:10:26 2016 - [info] Replication filtering check ok. Thu Jul 28 17:10:26 2016 - [info] GTID (with auto-pos) is not supported Thu Jul 28 17:10:26 2016 - [info] Starting SSH connection tests.. Thu Jul 28 17:10:27 2016 - [info] All SSH connection tests passed successfully. Thu Jul 28 17:10:27 2016 - [info] Checking MHA Node version.. Thu Jul 28 17:10:28 2016 - [info] Version check ok. Thu Jul 28 17:10:28 2016 - [info] Checking SSH publickey authentication settings on the current master.. Thu Jul 28 17:10:28 2016 - [info] HealthCheck: SSH to host_1 is reachable. Thu Jul 28 17:10:29 2016 - [info] Master MHA Node version is 0.56. Thu Jul 28 17:10:29 2016 - [info] Checking recovery script configurations on host_1(host_1:3306).. Thu Jul 28 17:10:29 2016 - [info] Executing command: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --start_file=host_1_name.000004 Thu Jul 28 17:10:29 2016 - [info] Connecting to root@host_1(host_1:22).. Creating /var/log/masterha/app1 if not exists.. ok. Checking output directory is accessible or not.. ok. Binlog found at /data/mysql.bin, up to host_1_name.000004 Thu Jul 28 17:10:29 2016 - [info] Binlog setting check done. Thu Jul 28 17:10:29 2016 - [info] Checking SSH publickey authentication and checking recovery script configurations on all alive slave servers.. Thu Jul 28 17:10:29 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='dba' --slave_host=host_2 --slave_ip=host_2 --slave_port=3306 --workdir=/var/log/masterha/app1 --target_version=5.7.13-log --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_2_name-relay-bin.000001 --slave_pass=xxx Thu Jul 28 17:10:29 2016 - [info] Connecting to root@host_2(host_2:22).. Checking slave recovery environment settings.. Relay log found at /data/mysql_data, up to host_2_name-relay-bin.000002 Temporary relay log file is /data/mysql_data/host_2_name-relay-bin.000002 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Jul 28 17:10:29 2016 - [info] Executing command : apply_diff_relay_logs --command=test --slave_user='dba' --slave_host=host_3 --slave_ip=host_3 --slave_port=3306 --workdir=/var/log/masterha/app1 --target_version=5.7.13-log --manager_version=0.56 --relay_dir=/data/mysql_data --current_relay_log=host_3_name-relay-bin.000001 --slave_pass=xxx Thu Jul 28 17:10:29 2016 - [info] Connecting to root@host_3(host_3:22).. Checking slave recovery environment settings.. Relay log found at /data/mysql_data, up to host_3_name-relay-bin.000002 Temporary relay log file is /data/mysql_data/host_3_name-relay-bin.000002 Testing mysql connection and privileges..mysql: [Warning] Using a password on the command line interface can be insecure. done. Testing mysqlbinlog output.. done. Cleaning up test file(s).. done. Thu Jul 28 17:10:29 2016 - [info] Slaves settings check done. Thu Jul 28 17:10:29 2016 - [info] host_1(host_1:3306) (current master) +--host_2(host_2:3306) +--host_3(host_3:3306) Thu Jul 28 17:10:29 2016 - [info] Checking master_ip_failover_script status: Thu Jul 28 17:10:29 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=status --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 Thu Jul 28 17:10:30 2016 - [info] OK. Thu Jul 28 17:10:30 2016 - [warning] shutdown_script is not defined. Thu Jul 28 17:10:30 2016 - [info] Set master ping interval 3 seconds. Thu Jul 28 17:10:30 2016 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes. Thu Jul 28 17:10:30 2016 - [info] Starting ping health check on host_1(host_1:3306).. Thu Jul 28 17:10:30 2016 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond.. Thu Jul 28 17:11:21 2016 - [warning] Got error on MySQL select ping: 2006 (MySQL server has gone away) Thu Jul 28 17:11:21 2016 - [info] Executing SSH check script: save_binary_logs --command=test --start_pos=4 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/save_binary_logs_test --manager_version=0.56 --binlog_prefix=host_1_name Thu Jul 28 17:11:21 2016 - [info] HealthCheck: SSH to host_1 is reachable. Thu Jul 28 17:11:24 2016 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111) Thu Jul 28 17:11:24 2016 - [warning] Connection failed 2 time(s).. Thu Jul 28 17:11:27 2016 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111) Thu Jul 28 17:11:27 2016 - [warning] Connection failed 3 time(s).. Thu Jul 28 17:11:30 2016 - [warning] Got error on MySQL connect: 2013 (Lost connection to MySQL server at 'reading initial communication packet', system error: 111) Thu Jul 28 17:11:30 2016 - [warning] Connection failed 4 time(s).. Thu Jul 28 17:11:30 2016 - [warning] Master is not reachable from health checker! Thu Jul 28 17:11:30 2016 - [warning] Master host_1(host_1:3306) is not reachable! Thu Jul 28 17:11:30 2016 - [warning] SSH is reachable. Thu Jul 28 17:11:30 2016 - [info] Connecting to a master server failed. Reading configuration file /etc/masterha_default.cnf and /etc/app1.cnf again, and trying to connect to all servers to check server status.. Thu Jul 28 17:11:30 2016 - [info] Reading default configuration from /etc/masterha_default.cnf.. Thu Jul 28 17:11:30 2016 - [info] Reading application default configuration from /etc/app1.cnf.. Thu Jul 28 17:11:30 2016 - [info] Reading server configuration from /etc/app1.cnf.. Thu Jul 28 17:11:30 2016 - [info] GTID failover mode = 0 Thu Jul 28 17:11:30 2016 - [info] Dead Servers: Thu Jul 28 17:11:30 2016 - [info] host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Alive Servers: Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Thu Jul 28 17:11:30 2016 - [info] Alive Slaves: Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 17:11:30 2016 - [info] Checking slave configurations.. Thu Jul 28 17:11:30 2016 - [info] read_only=1 is not set on slave host_2(host_2:3306). Thu Jul 28 17:11:30 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). Thu Jul 28 17:11:30 2016 - [info] Checking replication filtering settings.. Thu Jul 28 17:11:30 2016 - [info] Replication filtering check ok. Thu Jul 28 17:11:30 2016 - [info] Master is down! Thu Jul 28 17:11:30 2016 - [info] Terminating monitoring script. Thu Jul 28 17:11:30 2016 - [info] Got exit code 20 (Master dead). Thu Jul 28 17:11:30 2016 - [info] MHA::MasterFailover version 0.56. Thu Jul 28 17:11:30 2016 - [info] Starting master failover. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] * Phase 1: Configuration Check Phase.. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] GTID failover mode = 0 Thu Jul 28 17:11:30 2016 - [info] Dead Servers: Thu Jul 28 17:11:30 2016 - [info] host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Checking master reachability via MySQL(double check)... Thu Jul 28 17:11:30 2016 - [info] ok. Thu Jul 28 17:11:30 2016 - [info] Alive Servers: Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Thu Jul 28 17:11:30 2016 - [info] Alive Slaves: Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 17:11:30 2016 - [info] Starting Non-GTID based failover. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] ** Phase 1: Configuration Check Phase completed. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] * Phase 2: Dead Master Shutdown Phase.. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] Forcing shutdown so that applications never connect to the current master.. Thu Jul 28 17:11:30 2016 - [info] Executing master IP deactivation script: Thu Jul 28 17:11:30 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root Thu Jul 28 17:11:30 2016 - [info] done. Thu Jul 28 17:11:30 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Thu Jul 28 17:11:30 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] * Phase 3: Master Recovery Phase.. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000004:154 Thu Jul 28 17:11:30 2016 - [info] Latest slaves (Slaves that received relay log files to the latest): Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 17:11:30 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000004:154 Thu Jul 28 17:11:30 2016 - [info] Oldest slaves: Thu Jul 28 17:11:30 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 17:11:30 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:30 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:30 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Thu Jul 28 17:11:30 2016 - [info] Thu Jul 28 17:11:30 2016 - [info] Fetching dead master's binary logs.. Thu Jul 28 17:11:30 2016 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1_name.000004 --start_pos=154 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728171130.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 Creating /var/log/masterha/app1 if not exists.. ok. Concat binary/relay logs from host_1_name.000004 pos 154 to host_1_name.000004 EOF into /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728171130.binlog .. Binlog Checksum enabled Dumping binlog format description event, from position 0 to 154.. ok. No need to dump effective binlog data from /data/mysql.bin/host_1_name.000004 (pos starts 154, filesize 154). Skipping. Binlog Checksum enabled /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160728171130.binlog has no effective data events. Event not exists. Thu Jul 28 17:11:31 2016 - [info] Additional events were not found from the orig master. No need to save. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] * Phase 3.3: Determining New Master Phase.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] Finding the latest slave that has all relay logs for recovering other slaves.. Thu Jul 28 17:11:31 2016 - [info] All slaves received relay logs to the same position. No need to resync each other. Thu Jul 28 17:11:31 2016 - [info] Searching new master from slaves.. Thu Jul 28 17:11:31 2016 - [info] Candidate masters from the configuration file: Thu Jul 28 17:11:31 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:31 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:31 2016 - [info] Primary candidate for the new Master (candidate_master is set) Thu Jul 28 17:11:31 2016 - [info] Non-candidate masters: Thu Jul 28 17:11:31 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Thu Jul 28 17:11:31 2016 - [info] Replicating from host_1(host_1:3306) Thu Jul 28 17:11:31 2016 - [info] Not candidate for the new Master (no_master is set) Thu Jul 28 17:11:31 2016 - [info] Searching from candidate_master slaves which have received the latest relay log events.. Thu Jul 28 17:11:31 2016 - [info] New master is host_2(host_2:3306) Thu Jul 28 17:11:31 2016 - [info] Starting master failover.. Thu Jul 28 17:11:31 2016 - [info] From: host_1(host_1:3306) (current master) +--host_2(host_2:3306) +--host_3(host_3:3306) To: host_2(host_2:3306) (new master) +--host_3(host_3:3306) Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] * Phase 3.3: New Master Diff Log Generation Phase.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] * Phase 3.4: Master Log Apply Phase.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] *NOTICE: If any error happens from this phase, manual recovery is needed. Thu Jul 28 17:11:31 2016 - [info] Starting recovery on host_2(host_2:3306).. Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. Waiting all logs to be applied.. Thu Jul 28 17:11:31 2016 - [info] done. Thu Jul 28 17:11:31 2016 - [info] All relay logs were successfully applied. Thu Jul 28 17:11:31 2016 - [info] Getting new master's binlog name and position.. Thu Jul 28 17:11:31 2016 - [info] host_2_name.000002:294 Thu Jul 28 17:11:31 2016 - [info] All other slaves should start replication from here. Statement should be: CHANGE MASTER TO MASTER_HOST='host_2', MASTER_PORT=3306, MASTER_LOG_FILE='host_2_name.000002', MASTER_LOG_POS=294, MASTER_USER='repl', MASTER_PASSWORD='xxx'; Thu Jul 28 17:11:31 2016 - [info] Executing master IP activate script: Thu Jul 28 17:11:31 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --command=start --ssh_user=root --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --new_master_host=host_2 --new_master_ip=host_2 --new_master_port=3306 --new_master_user='dba' --new_master_password='dba' Set read_only=0 on the new master. No need to Creating app user on the new master.. Thu Jul 28 17:11:31 2016 - [info] OK. Thu Jul 28 17:11:31 2016 - [info] ** Finished master recovery successfully. Thu Jul 28 17:11:31 2016 - [info] * Phase 3: Master Recovery Phase completed. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] * Phase 4: Slaves Recovery Phase.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] * Phase 4.1: Starting Parallel Slave Diff Log Generation Phase.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] -- Slave diff file generation on host host_3(host_3:3306) started, pid: 22427. Check tmp log /var/log/masterha/app1/host_3_3306_20160728171130.log if it takes time.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] Log messages from host_3 ... Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. No need to generate diff files from the latest slave. Thu Jul 28 17:11:31 2016 - [info] End of log messages from host_3. Thu Jul 28 17:11:31 2016 - [info] -- host_3(host_3:3306) has the latest relay log events. Thu Jul 28 17:11:31 2016 - [info] Generating relay diff files from the latest slave succeeded. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] * Phase 4.2: Starting Parallel Slave Log Apply Phase.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] -- Slave recovery on host host_3(host_3:3306) started, pid: 22429. Check tmp log /var/log/masterha/app1/host_3_3306_20160728171130.log if it takes time.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] Log messages from host_3 ... Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] Starting recovery on host_3(host_3:3306).. Thu Jul 28 17:11:31 2016 - [info] This server has all relay logs. Waiting all logs to be applied.. Thu Jul 28 17:11:31 2016 - [info] done. Thu Jul 28 17:11:31 2016 - [info] All relay logs were successfully applied. Thu Jul 28 17:11:31 2016 - [info] Resetting slave host_3(host_3:3306) and starting replication from the new master host_2(host_2:3306).. Thu Jul 28 17:11:31 2016 - [info] Executed CHANGE MASTER. Thu Jul 28 17:11:31 2016 - [info] Slave started. Thu Jul 28 17:11:31 2016 - [info] End of log messages from host_3. Thu Jul 28 17:11:31 2016 - [info] -- Slave recovery on host host_3(host_3:3306) succeeded. Thu Jul 28 17:11:31 2016 - [info] All new slave servers recovered successfully. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] * Phase 5: New master cleanup phase.. Thu Jul 28 17:11:31 2016 - [info] Thu Jul 28 17:11:31 2016 - [info] Resetting slave info on the new master.. Thu Jul 28 17:11:31 2016 - [info] host_2: Resetting slave info succeeded. Thu Jul 28 17:11:31 2016 - [info] Master failover to host_2(host_2:3306) completed successfully. Thu Jul 28 17:11:31 2016 - [info] ----- Failover Report ----- app1: MySQL Master failover host_1(host_1:3306) to host_2(host_2:3306) succeeded Master host_1(host_1:3306) is down! Check MHA Manager logs at host_manager_name:/var/log/masterha/app1/app1.log for details. Started automated(non-interactive) failover. Invalidated master IP address on host_1(host_1:3306) The latest slave host_2(host_2:3306) has all relay logs for recovery. Selected host_2(host_2:3306) as a new master. host_2(host_2:3306): OK: Applying all logs succeeded. host_2(host_2:3306): OK: Activated master IP address. host_3(host_3:3306): This host has the latest relay log events. Generating relay diff files from the latest slave succeeded. host_3(host_3:3306): OK: Applying all logs succeeded. Slave started, replicating from host_2(host_2:3306) host_2(host_2:3306): Resetting slave info succeeded. Master failover to host_2(host_2:3306) completed successfully. |
4.6 [用例测试] MySQL master too many connection,无权限,响应慢
- step1: 检查各种MHA manager的各种环境
1 2 3 4 5 6 7 8 9 10 |
* 检查SSH masterha_check_ssh --conf=/etc/app1.cnf * 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告) masterha_check_repl --conf=/etc/app1.cnf Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306). Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). * 检查mha状态 masterha_check_status --conf=/etc/app1.cnf * 开启mha 监控 nohup masterha_manager --conf=/etc/app1.cnf & |
- step2: 手工制造master too many connection
1 2 3 4 |
mysqlslap --concurrency=4000 --iterations=10 --query='insert into tb(id_2) select (1);' --debug-info -hhost_1 -uxx -pxx --create-schema=lc mysqlslap: Error when connecting to server: 1040 Too many connections mysqlslap: Error when connecting to server: 1040 Too many connections mysqlslap: Error when connecting to server: 1040 Too many connections |
- step3: 检查是否发生failover
1 2 3 4 5 6 |
* 1. 手动开启另一个session,链接报错tmc。 但是并没有发生failover,说明master 还活着 * 2. 检查了MHA的health check机制,原来是长链接 DIGEST_TEXT: SELECT ? AS VALUE COUNT_STAR: 46 FIRST_SEEN: 2016-08-02 11:28:00 LAST_SEEN: 2016-08-02 11:30:37 |
4.7 [用例测试] MySQL master 服务器 down掉,且候选master落后的最多, 是否自动failover,知否可以成功的做日志补偿
- step1: 检查各种MHA manager的各种环境
1 2 3 4 5 6 7 8 9 10 |
* 检查SSH masterha_check_ssh --conf=/etc/app1.cnf * 检查复制(如果是非GTID模式,slave如果没有设置relay_log_purge=0,那么会有警告) masterha_check_repl --conf=/etc/app1.cnf Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_2(host_2:3306). Thu Jul 28 14:03:33 2016 - [warning] relay_log_purge=0 is not set on slave host_3(host_3:3306). * 检查mha状态 masterha_check_status --conf=/etc/app1.cnf * 开启mha 监控 nohup masterha_manager --conf=/etc/app1.cnf & |
- step2: 手动制造候选master落后的情况
1 |
候选master> stop slave; sleep 600s ; start slave; |
- step3: 手动制造master 服务器挂掉的情况
1 |
master > reboot |
- step4: 观察MHA 日志
1 2 3 |
Tue Aug 2 14:33:34 2016 - [warning] Slave host_2(host_2:3306) SQL Thread delays too much. Latest log file:host_1_name.000002:576864, Current log file:host_1_name.000001:39731438. This server is not selected as a new master because recovery will take long time. Tue Aug 2 14:33:34 2016 - [error][/usr/share/perl5/vendor_perl/MHA/MasterFailover.pm, ln970] None of existing slaves matches as a new master. Maybe preferred node is misconfigured or all slaves are too far behind. Tue Aug 2 14:33:34 2016 - [error][/usr/share/perl5/vendor_perl/MHA/ManagerUtil.pm, ln177] Got ERROR: at /usr/bin/masterha_master_switch line 53 |
- step5:重新修正配置文件
由于要指定特定的slave为候选master,而此slave还落后非常多
必须在每组服务器上都加上check_repl_delay=0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
vi /etc/app1.cnf [server default] remote_workdir=/var/log/masterha/app1 manager_workdir=/var/log/masterha/app1 manager_log=/var/log/masterha/app1/app1.log [server1] hostname=host_1 candidate_master=1 check_repl_delay=0 [server2] hostname=host_2 candidate_master=1 check_repl_delay=0 [server3] hostname=host_3 no_master=1 check_repl_delay=0 |
- step6:用手工failover再来模拟
1 |
masterha_master_switch --master_state=dead --conf=/etc/app1.cnf --dead_master_host=host_1 --interactive=1 --ignore_last_failover --new_master_host=host_2 |
- step7: 观察MHA 日志
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 |
Tue Aug 2 14:38:55 2016 - [info] Reading default configuration from /etc/masterha_default.cnf.. Tue Aug 2 14:38:55 2016 - [info] Reading application default configuration from /etc/app1.cnf.. Tue Aug 2 14:38:55 2016 - [info] Reading server configuration from /etc/app1.cnf.. Tue Aug 2 14:38:55 2016 - [info] MHA::MasterFailover version 0.56. Tue Aug 2 14:38:55 2016 - [info] Starting master failover. Tue Aug 2 14:38:55 2016 - [info] Tue Aug 2 14:38:55 2016 - [info] * Phase 1: Configuration Check Phase.. Tue Aug 2 14:38:55 2016 - [info] Tue Aug 2 14:38:55 2016 - [info] GTID failover mode = 0 Tue Aug 2 14:38:55 2016 - [info] Dead Servers: Tue Aug 2 14:38:55 2016 - [info] host_1(host_1:3306) Tue Aug 2 14:38:55 2016 - [info] Checking master reachability via MySQL(double check)... Tue Aug 2 14:38:55 2016 - [info] ok. Tue Aug 2 14:38:55 2016 - [info] Alive Servers: Tue Aug 2 14:38:55 2016 - [info] host_2(host_2:3306) Tue Aug 2 14:38:55 2016 - [info] host_3(host_3:3306) Tue Aug 2 14:38:55 2016 - [info] Alive Slaves: Tue Aug 2 14:38:55 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Tue Aug 2 14:38:55 2016 - [info] Replicating from host_1(host_1:3306) Tue Aug 2 14:38:55 2016 - [info] Primary candidate for the new Master (candidate_master is set) Tue Aug 2 14:38:55 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Tue Aug 2 14:38:55 2016 - [info] Replicating from host_1(host_1:3306) Tue Aug 2 14:38:55 2016 - [info] Not candidate for the new Master (no_master is set) Master host_1(host_1:3306) is dead. Proceed? (yes/NO): yes Tue Aug 2 14:38:57 2016 - [info] Starting Non-GTID based failover. Tue Aug 2 14:38:57 2016 - [info] Tue Aug 2 14:38:57 2016 - [info] ** Phase 1: Configuration Check Phase completed. Tue Aug 2 14:38:57 2016 - [info] Tue Aug 2 14:38:57 2016 - [info] * Phase 2: Dead Master Shutdown Phase.. Tue Aug 2 14:38:57 2016 - [info] Tue Aug 2 14:38:58 2016 - [info] HealthCheck: SSH to host_1 is reachable. Tue Aug 2 14:38:58 2016 - [info] Forcing shutdown so that applications never connect to the current master.. Tue Aug 2 14:38:58 2016 - [info] Executing master IP deactivation script: Tue Aug 2 14:38:58 2016 - [info] /home/mysql/MHA/masterha/master_ip_failover --orig_master_host=host_1 --orig_master_ip=host_1 --orig_master_port=3306 --command=stopssh --ssh_user=root Tue Aug 2 14:38:58 2016 - [info] done. Tue Aug 2 14:38:58 2016 - [warning] shutdown_script is not set. Skipping explicit shutting down of the dead master. Tue Aug 2 14:38:58 2016 - [info] * Phase 2: Dead Master Shutdown Phase completed. Tue Aug 2 14:38:58 2016 - [info] Tue Aug 2 14:38:58 2016 - [info] * Phase 3: Master Recovery Phase.. Tue Aug 2 14:38:58 2016 - [info] Tue Aug 2 14:38:58 2016 - [info] * Phase 3.1: Getting Latest Slaves Phase.. Tue Aug 2 14:38:58 2016 - [info] Tue Aug 2 14:38:58 2016 - [info] The latest binary log file/position on all slaves is host_1_name.000002:576864 Tue Aug 2 14:38:58 2016 - [info] Latest slaves (Slaves that received relay log files to the latest): Tue Aug 2 14:38:58 2016 - [info] host_3(host_3:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Tue Aug 2 14:38:58 2016 - [info] Replicating from host_1(host_1:3306) Tue Aug 2 14:38:58 2016 - [info] Not candidate for the new Master (no_master is set) Tue Aug 2 14:38:58 2016 - [info] The oldest binary log file/position on all slaves is host_1_name.000001:39731619 Tue Aug 2 14:38:58 2016 - [info] Oldest slaves: Tue Aug 2 14:38:58 2016 - [info] host_2(host_2:3306) Version=5.7.13-log (oldest major version between slaves) log-bin:enabled Tue Aug 2 14:38:58 2016 - [info] Replicating from host_1(host_1:3306) Tue Aug 2 14:38:58 2016 - [info] Primary candidate for the new Master (candidate_master is set) Tue Aug 2 14:38:58 2016 - [info] Tue Aug 2 14:38:58 2016 - [info] * Phase 3.2: Saving Dead Master's Binlog Phase.. Tue Aug 2 14:38:58 2016 - [info] Tue Aug 2 14:38:58 2016 - [info] Fetching dead master's binary logs.. Tue Aug 2 14:38:58 2016 - [info] Executing command on the dead master host_1(host_1:3306): save_binary_logs --command=save --start_file=host_1_name.000002 --start_pos=576864 --binlog_dir=/data/mysql.bin --output_file=/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog --handle_raw_binlog=1 --disable_log_bin=0 --manager_version=0.56 Creating /var/log/masterha/app1 if not exists.. ok. Concat binary/relay logs from host_1_name.000002 pos 576864 to host_1_name.000002 EOF into /var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog .. Binlog Checksum enabled Dumping binlog format description event, from position 0 to 154.. ok. Dumping effective binlog data from /data/mysql.bin/host_1_name.000002 position 576864 to tail(577435).. ok. Binlog Checksum enabled Concat succeeded. saved_master_binlog_from_host_1_3306_20160802143855.binlog 100% 725 0.7KB/s 00:00 Tue Aug 2 14:38:59 2016 - [info] scp from root@host_1:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog to local:/var/log/masterha/app1/saved_master_binlog_from_host_1_3306_20160802143855.binlog succeeded. Tue Aug 2 14:38:59 2016 - [info] HealthCheck: SSH to host_2 is reachable. Tue Aug |