TL:DR: If you have long running transactions, MySQL does not deal well with this, and it will slow down the box. That’s okay as long as you are basically alone on your box, but if you aren’t, the others will hate you.
The database machine ‘somehierarchy-02’ in a general purpose load balancer pool for somehierarchy had replication delay.
It’s a MySQL replica and is receiving the same write workload than all the other boxen in that pool. Yet, somehierarchy-03 is fine, while somehierarchy-02 is not. Both machines have comparable hardware: -02 and -03 are both Dell M630 with 128 GB of memory and two SSD. They should behave identically, yet one runs from memory, but the other is reading 40 MB/s from disk.
Machine somehierarchy-02 reading up to 40 MB/s from disk. The sister box runs from memory.
The objective was to find the root cause for this, as it was causing replication delay and generally making people unhappy.
Together with a colleague we established that both boxes
replicate from the same master and that they receive identical
binlogs. We did find indeed the same set of GTIDs, a similar
amount of binlog files, and a bit of Ghetto statistics
mysqlbinlog -v binlog... | egrep -i '(insert|update|delete)' |
awk ... | sort | uniq -c) on the active tables per time also
found nothing unusual.
We then did a
mysqldump --no-data schemaname to validate
that the schemas are identical. They indeed were, same tables,
same table structures, same indexes.
It’s reads, coming from a client
We can therefore rule out anything related to replication or disk writes. The difference must be in the reads, that is, the two boxes receive vastly different workload from clients.
Comparing the active
PROCESSLIST between the two boxes did not
yield results, though. Checking back we found that the problem
had been going away while we were looking. So whatever we were
looking for would not be in processlists, only in logs.
So, do we have logs?
log_processlist is still a thing
Turns out: we do. The tool log_processlist is older than time itself, it goes back to before 2007. Simon Mudd and I wrote that once, to have a flight recorder for crashed machines.
Basically it dumps data from the operating system and the
various MySQL tables into the directory
we have a weekly ring-buffer like directory structure:
When a box crashes or misbehaves, as long as you have access to
this subtree, you can go there and inspect the files to see what
the box did just before it crashed. So we go into
subdirectory for Monday, and look into some files, for example
the 10:20 recordings:
In the 10_20.xz we find the MySQL processlist and a few other things, in the 10_20.unix file.xz we find the UNIX processlist, system memory and other system stuff, and in the 10_20.innodb file there are various InnoDB related table dumps from this point in time. There is also a dump of the entire sys set of views every 15 minutes, so 10_21.sys.gz has that.
Checking the processlist shows many sleeping connections. We filter these and immediately hit paydirt:
The “Connect” only means that these are replication threads. Running Queries have “Query” and here the number is the seconds of runtime for this query. There are apparently a number of queries with a large runtime: 33242 seconds are 9.25 hours - they are running since 1 am.
At around 1:00am the
hadoopuser connected from hadoop-07 to
somehierarchy-02 and started a number of queries:
and similar. They read around 70M rows from MessageLog each, a table that itself has 540M rows. That takes time.
During that time the “select” statements maintains a read-only transaction to create a stable view of the table it is reading. That means that as we change rows in MessageLog the old versions of the row are being shifted to the Undo log, and while other transactions read the new version of the row from the tablespace, these transactions read old versions of the row from the Undo log.
Or they try. They hit the tablespace first, see a new version of the row, are being sent one version into the past, find that this is still too recent, go one step deeper into the past and so on. We are essentially traversing a linked list on disk, without caching. MySQL is not really meant to deal well with 9h old transactions.
We can monitor this:
and we will. Not necessarily a thing to alert on at night, but certainly a thing to check up on early on in debugging.