刚处理完“挖矿”事件,在做最后一个MySQL
NBU备份的时候,发现从库有问题,好奇的是怎么主从状态异常没有告警呢?先不管这么多了,处理了这个问题再完善告警内容。

年后回来查看mysql运行状况与备份情况,登录mysql从库查看主从同步状态

介绍

经常会在错误日志中看到这个报错,首先我们可以从show
GLOBAL status like
‘%Aborte%’;里面看看两种错误连接的数量,以下是复制官方文档的解释。
  

 

一、错误信息

 1 mysql> show slave status\G;
 2 *************************** 1. row ***************************
 3                Slave_IO_State: 
 4                   Master_Host: 101.200.*.*
 5                   Master_User: backup
 6                   Master_Port: 3306
 7                 Connect_Retry: 60
 8               Master_Log_File: master-bin.000113
 9           Read_Master_Log_Pos: 276925387
10                Relay_Log_File: mysql-relay.000001
11                 Relay_Log_Pos: 4
12         Relay_Master_Log_File: master-bin.000113
13              Slave_IO_Running: No
14             Slave_SQL_Running: Yes
15               Replicate_Do_DB: 
16           Replicate_Ignore_DB: 
17            Replicate_Do_Table: 
18        Replicate_Ignore_Table: 
19       Replicate_Wild_Do_Table: 
20   Replicate_Wild_Ignore_Table: 
21                    Last_Errno: 0
22                    Last_Error: 
23                  Skip_Counter: 0
24           Exec_Master_Log_Pos: 276925387
25               Relay_Log_Space: 120
26               Until_Condition: None
27                Until_Log_File: 
28                 Until_Log_Pos: 0
29            Master_SSL_Allowed: No
30            Master_SSL_CA_File: 
31            Master_SSL_CA_Path: 
32               Master_SSL_Cert: 
33             Master_SSL_Cipher: 
34                Master_SSL_Key: 
35         Seconds_Behind_Master: NULL
36 Master_SSL_Verify_Server_Cert: No
37                 Last_IO_Errno: 1236
38                 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file'
39                Last_SQL_Errno: 0
40                Last_SQL_Error: 
41   Replicate_Ignore_Server_Ids: 
42              Master_Server_Id: 21
43                   Master_UUID: e4a43da7-5b58-11e5-a12f-00163e003632
44              Master_Info_File: /home/data/mysql/master.info
45                     SQL_Delay: 0
46           SQL_Remaining_Delay: NULL
47       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
48            Master_Retry_Count: 86400
49                   Master_Bind: 
50       Last_IO_Error_Timestamp: 170204 10:48:06
51      Last_SQL_Error_Timestamp: 
52                Master_SSL_Crl: 
53            Master_SSL_Crlpath: 
54            Retrieved_Gtid_Set: 
55             Executed_Gtid_Set: 
56                 Auto_Position: 0
57 1 row in set (0.00 sec)
58 
59 ERROR: 
60 No query specified

Aborted_connects 

If a client is unable even to connect, the server increments
the Aborted_connects status
variable. Unsuccessful connection attempts can occur for the following
reasons:

  • A client attempts to access a database but has no privileges for it.

  • A client uses an incorrect password.

  • A connection packet does not contain the right information.

  • It takes more
    than connect_timeout seconds
    to obtain a connect packet. See Section 5.1.4, “Server System
    Variables”.

If these kinds of things happen, it might indicate that someone is
trying to break into your server! If the general query log is enabled,
messages for these types of problems are logged to it.

 

对于Aborted_connects错误的大概原因有:

1.客户端连接一个没有授权的数据库

2.密码错误

3.连接包信息错误

4.连接超时(默认10s)

从库show slave status \G看到的错误信息如下:

突然之间Slave_IO_Running: 状态变成NO了

Aborted_clients

If a client successfully connects but later disconnects improperly or is
terminated, the server increments
the Aborted_clients status
variable, and logs an Aborted connection message
to the error log. The cause can be any of the following:

  • The client program did not
    call mysql_close() before
    exiting.

  • The client had been sleeping more
    than wait_timeout or interactive_timeout seconds
    without issuing any requests to the server. See Section 5.1.4,
    “Server System
    Variables”.

  • The client program ended abruptly in the middle of a data transfer.

Other reasons for problems with aborted connections or aborted clients:

  • The max_allowed_packet variable
    value is too small or queries require more memory than you have
    allocated for mysqld.
    See Section B.5.2.10, “Packet Too
    Large”.

  • Use of Ethernet protocol with Linux, both half and full duplex. Some
    Linux Ethernet drivers have this bug. You should test for this bug
    by transferring a huge file using FTP between the client and server
    machines. If a transfer goes in burst-pause-burst-pause mode, you
    are experiencing a Linux duplex syndrome. Switch the duplex mode for
    both your network card and hub/switch to either full duplex or to
    half duplex and test the results to determine the best setting.

  • A problem with the thread library that causes interrupts on reads.

  • Badly configured TCP/IP.

  • Faulty Ethernets, hubs, switches, cables, and so forth. This can be
    diagnosed properly only by replacing hardware.

 

对于 Aborted_clients的原因大概有

1.客户端没有执行mysql_close()关闭

2.由于连接一直没有关闭导致时间超过wait_timeout or interactive_timeout这两个变量的值(这两个变量的超时时间是8小时)

3.客户端在程序执行过程中结束

4.max_allowed_packet包设的过小

5.网络原因

6.线程bug等

Slave_IO_Running: No
Slave_SQL_Running: Yes
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position; the first event 'mysql-bin.000081' at 480141113, the last event read from './mysql-bin.000081' at 4, the last byte read from './mysql-bin.000081' at 4.'

 

总结

 

 

备注:

    作者:pursuer.chen

    博客:http://www.cnblogs.com/chenmh

本站点所有随笔都是原创,欢迎大家转载;但转载时必须注明文章来源,且在文章开头明显处给明链接。

《欢迎交流讨论》

二、错误原因

解决方法。

这里看到从库的io_thread已经终止,错误编号是1236,具体是由于读取主库的binlog日志位置(the first event ‘mysql-bin.000081’ at 480141113, the
last event read from ‘./mysql-bin.000081’ at
4)不对导致主从失败建立失败。

首先在从库上执行

三、解决方案

stop slave;

1.检查从库状态以及读取、执行的binlog信息

 

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: 
                  Master_Host: xx.xx.xx.xx
                  Master_User: username
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000081
          Read_Master_Log_Pos: 480141113
               Relay_Log_File: mysql9017-relay-bin.000163
                Relay_Log_Pos: 480141259
        Relay_Master_Log_File: mysql-bin.000081
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 480141113
              Relay_Log_Space: 480141462
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1236
                Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from impossible position; the first event 'mysql-bin.000081' at 480141113, the last event read from './mysql-bin.000081' at 4, the last byte read from './mysql-bin.000081' at 4.'
               Last_SQL_Errno: 0
               Last_SQL_Error: 
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 17
1 row in set (0.00 sec)

查看主库master状态

2.查看主库的binlog内容

 1 mysql> show master status\G;
 2 *************************** 1. row ***************************
 3              File: mysql-bin.000113
 4          Position: 276925387
 5      Binlog_Do_DB: 
 6  Binlog_Ignore_DB: 
 7 Executed_Gtid_Set: 
 8 1 row in set (0.00 sec)
 9 
10 ERROR: 
11 No query specified
12 
13 mysql> flush logs;
14 Query OK, 0 rows affected (0.11 sec)

[backup]# mysqlbinlog  mysql-bin.000081 >mysql-bin.log

刷新binlog日志

图片 1

flush logs;

看到主库binlog日志mysql-bin.000081最大的pos为480140557,但从库要读取的是’mysql-bin.000081′ at
480141113,显然从库要读的pos值比主库本身存在的pos值大,导致读取不到,进而失败。

刷新后的日志会+1

可通过下面语句查看binlog的pos信息和日志内容
mysql> show binlog events in  ‘mysql-bin.000081’ from 480140557 limit
10;       
Empty set (0.04 sec)
3.更改从库的同步位置,完成数据重新同步

例如上面的 File: mysql-bin.000113 会变成 File: mysql-bin.000114

 主库:

再次查看master状态

mysqlbinlog  mysql-bin.000082  |more

 1 mysql> show master status\G;
 2 *************************** 1. row ***************************
 3              File: mysql-bin.000114
 4          Position: 120
 5      Binlog_Do_DB: 
 6  Binlog_Ignore_DB: 
 7 Executed_Gtid_Set: 
 8 1 row in set (0.00 sec)
 9 
10 ERROR: 
11 No query specified

从库:

然后就不需要在操作master,切换到从库

change master to master_host='xx.xx.xx.xx',master_user='username',master_port=3306,master_password='password',master_log_file='mysql-bin.000082',master_log_pos=4;

输入CHANGE MASTER TO
MASTER_LOG_FILE=’mysql-bin.000114′,MASTER_LOG_POS=120;

start slave;

执行start slave;

show slave status \G

查看从库状态

主从同步正常

 1 mysql> show slave status\G;
 2 *************************** 1. row ***************************
 3                Slave_IO_State: Waiting for master to send event
 4                   Master_Host: 101.200.*.*
 5                   Master_User: backup
 6                   Master_Port: 3306
 7                 Connect_Retry: 60
 8               Master_Log_File: mysql-bin.000114
 9           Read_Master_Log_Pos: 11314
10                Relay_Log_File: mysql-relay.000002
11                 Relay_Log_Pos: 11477
12         Relay_Master_Log_File: mysql-bin.000114
13              Slave_IO_Running: Yes
14             Slave_SQL_Running: Yes
15               Replicate_Do_DB: 
16           Replicate_Ignore_DB: 
17            Replicate_Do_Table: 
18        Replicate_Ignore_Table: 
19       Replicate_Wild_Do_Table: 
20   Replicate_Wild_Ignore_Table: 
21                    Last_Errno: 0
22                    Last_Error: 
23                  Skip_Counter: 0
24           Exec_Master_Log_Pos: 11314
25               Relay_Log_Space: 11646
26               Until_Condition: None
27                Until_Log_File: 
28                 Until_Log_Pos: 0
29            Master_SSL_Allowed: No
30            Master_SSL_CA_File: 
31            Master_SSL_CA_Path: 
32               Master_SSL_Cert: 
33             Master_SSL_Cipher: 
34                Master_SSL_Key: 
35         Seconds_Behind_Master: 0
36 Master_SSL_Verify_Server_Cert: No
37                 Last_IO_Errno: 0
38                 Last_IO_Error: 
39                Last_SQL_Errno: 0
40                Last_SQL_Error: 
41   Replicate_Ignore_Server_Ids: 
42              Master_Server_Id: 21
43                   Master_UUID: e4a43da7-5b58-11e5-a12f-00163e003632
44              Master_Info_File: /home/data/mysql/master.info
45                     SQL_Delay: 0
46           SQL_Remaining_Delay: NULL
47       Slave_SQL_Running_State: Slave has read all relay log; waiting for the slave I/O thread to update it
48            Master_Retry_Count: 86400
49                   Master_Bind: 
50       Last_IO_Error_Timestamp: 
51      Last_SQL_Error_Timestamp: 
52                Master_SSL_Crl: 
53            Master_SSL_Crlpath: 
54            Retrieved_Gtid_Set: 
55             Executed_Gtid_Set: 
56                 Auto_Position: 0
57 1 row in set (0.00 sec)
58 
59 ERROR: 
60 No query specified

图片 2

 

4.主库参数改进

 导致这个原因很大程度上是由于主从在同步的过程中,主库异常断电,导致内存数据传输到从库但没有提交到binlog日志,即主库
sync_binlog设置可能有问题,在主库检查参数设置:

mysql> show global variables like '%sync_binlog%';                       
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| sync_binlog   | 0     |
+---------------+-------+
1 row in set (0.00 sec)

果然其值是 0,不主动同步binlog
cache的数据到磁盘,而依赖操作系统本身不定期把文件内容 flush 到磁盘。设为
1 最安全,在每个语句或事务后同步一次 binary
log,即使在崩溃时也最多丢失一个语句或事务的日志,但因此也最慢。这里设置为0,断电的情况下导致binlog
cache数据丢失没有写入主库的binlog,但binlog信息已同步至从库。这种情况容易导致主从数据不一致,所以即使恢复主从数据后,依旧要通过主从数据对比校验数据的一致性。

mysql> set global sync_binlog=1;
Query OK, 0 rows affected (0.00 sec)

更改配置文件my.cnf设置sync_binlog=1

5.主从数据校验

 pt-table-checksum h=master_ipaddr,u=username,p=’password’,P=mysql_port
–nocheck-binlog-format –recursion-method=hosts

pt-table-checksum h=master_ipaddr,u=username,p='password',P=mysql_port --nocheck-binlog-format --recursion-method=hosts  
Checking if all tables can be checksummed ...
Starting checksum ...
            TS ERRORS  DIFFS     ROWS  CHUNKS SKIPPED    TIME TABLE
08-03T17:49:29      0      0      595       1       0   0.186 user.hole

其中–recursion-method有几种方式查看从库信息,这里采用的是hosts方式,需要在从库加入如下参数,方可在主库执行show
slave hosts查看从库的信息

report_host=slave_ip

 

report_port=slave_port

METHOD       USES
===========  =============================================
processlist  SHOW PROCESSLIST
hosts        SHOW SLAVE HOSTS
cluster      SHOW STATUS LIKE 'wsrep\_incoming\_addresses'
dsn=DSN      DSNs from a table
none         Do not find slaves

 

6.innodb_flush_log_at_trx_commit参数扩展

innodb_flush_log_at_trx_commit 参数指定了 InnoDB 在事务提交后的日志写入频率。这么说其实并不严谨,且看其不同取值的意义和表现。

    当 innodb_flush_log_at_trx_commit 取值为 0 的时候,log buffer 会 每秒写入到日志文件并刷写(flush)到磁盘。但每次事务提交不会有任何影响,也就是 log buffer 的刷写操作和事务提交操作没有关系。在这种情况下,MySQL性能最好,但如果 mysqld 进程崩溃,通常会导致最后 1s 的日志丢失。
    当取值为 1 时,每次事务提交时,log buffer 会被写入到日志文件并刷写到磁盘。这也是默认值。这是最安全的配置,但由于每次事务都需要进行磁盘I/O,所以也最慢。
    当取值为 2 时,每次事务提交会写入日志文件,但并不会立即刷写到磁盘,日志文件会每秒刷写一次到磁盘。这时如果 mysqld 进程崩溃,由于日志已经写入到系统缓存,所以并不会丢失数据;在操作系统崩溃的情况下,通常会导致最后 1s 的日志丢失。

 

发表评论

电子邮件地址不会被公开。 必填项已用*标注

网站地图xml地图