slave_net_timeout配置引起的报错-蒲公英云

slave_net_timeout配置引起的报错

最近线上MySQL经常有Note信息如下：

[Note] While initializing dump thread for slave with UUID <>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(49).
[Note] Start binlog_dump to master_thread_id(50) slave_server(530094), pos(, 4)
[Note] Start semi-sync binlog_dump to slave (server_id: 530094), pos(, 4)

参考： http://blog.itpub.net/7728585/viewspace-2659597/ 博客找到了答案

通过查阅MySQL官方文档，原文如下：

The number of seconds to wait for more data or a heartbeat signal from the source before the replica considers the connection broken, aborts the read, and tries to reconnect. Setting this variable has no immediate effect. The state of the variable applies on all subsequent START SLAVE commands.
The first retry occurs immediately after the timeout. The interval between retries is controlled by the MASTER_CONNECT_RETRY option for the CHANGE MASTER TO statement, and the number of reconnection attempts is limited by the MASTER_RETRY_COUNT option for the CHANGE MASTER TO statement.
The heartbeat interval, which stops the connection timeout occurring in the absence of data if the connection is still good, is controlled by the MASTER_HEARTBEAT_PERIOD option for the CHANGE MASTER TO statement. The heartbeat interval defaults to half the value of slave_net_timeout, and it is recorded in the replica's connection metadata repository and shown in the replication_connection_configuration Performance Schema table. Note that a change to the value or default setting of slave_net_timeout does not automatically change the heartbeat interval, whether that has been set explicitly or is using a previously calculated default. If the connection timeout is changed, you must also issue CHANGE MASTER TO to adjust the heartbeat interval to an appropriate value so that it occurs before the connection timeout.

简单来说就是从库多久没有从主库接收到数据/心跳信息，就认为主从连接断开，从而进行重连。心跳间隔是由CHANGE MASTER TO语句的MASTER_HEARTBEAT_PERIOD选项控制的，如果不配置默认为slave_net_timeout/2，如果动态修改了slave_net_timeout参数，则需要手动进行重连，心跳间隔才会进行变更。

流程大致为：从库在建立主从连接时，通过SET @master_heartbeat_period= %s”的方式设置心跳包间隔，每隔心跳时间，就发送一次心跳信息给从库，从库在slave_net_timeout超时时间内接收到了心跳包/数据，则认为主从连接正常

在本次故障中，是因为手动修改了slave_net_timeout参数，但是没有重新建立主从导致的从库在slave_net_timeout时间内没有收到心跳信息，认为连接不可用了，从而发起频繁重连。

解决：手动设置slave_net_timeout参数后，手动进行重连主库，或者在change master步骤手动设置心跳发送间隔。