No more locks with Xtrabackup
Percona’s Xtrabackup is a great product. It performs binary backups of heavily loaded Mysql servers amazingly fast. We’ve used it here at Etsy for years and works very well for us. Mysqldump is another way of doing backups, but for servers with hundreds of gigabytes of data, it’s too slow to be useful, especially for restoring a backup. It can literally take days to restore a couple of hundred of gigabytes generated with Mysqldump. Usually when you need to restore from a backup, you’re in some sort of emergency, and waiting days is not an option.
But Xtrabackup has a significant shortcoming: it needs a global lock at the end of its procedure. Not good for a server doing hundreds or thousands of queries per second. Percona touts Xtrabackup as “non-blocking”, which is for the most part true, but not entirely.
Restoring a Mysql server, most of the time, involves installing an OS, Mysql, and any additional packages you want. Then restoring the latest Xtrabackup data and finally give the new server replication coordinates where it left off before the crash so it can get to the point where its data is consistent with its master. Here’s where the lock comes in: to get a reliable reading of replication coordinates, Xtrabackup issues a “FLUSH TABLES WITH READ LOCK” in the final stage of its process. When the lock is granted, Xtrabackup reads the “MASTER STATUS” and releases the lock. FTWRL is very disruptive to a busy server. It tells Mysql to start the process of read/write locking all tables, which in turn causes all new connections to wait for the lock to be released. Mysql then waits for all outstanding queries to finish, and then grants the lock. If there’s a long running query when FTWRL is requested, you will undoubtedly get a query pile-up which can quickly overwhelm the maximum number of connections your server is configured to accept and your application will stall.
Percona’s documentation states that there is a –no-lock option for Xtrabackup. It also states “Use this option to disable table lock with FLUSH TABLES WITH READ LOCK. Use it only if ALL your tables are InnoDB and you DO NOT CARE about the binary log position of the backup”. We don’t want any locks, but we do want the “binary log position of the backup”, aka replication coordinates.
It turns out that replication coordinates are hidden in the backup files when you run it with the –no-lock option. You just have to know how to get to them. Xtrabackup’s backup procedure involves copying Mysql’s data files to another location, knowing the exact point-in-time when the copying started, and creating an extra file, named xtrabackup_logfile, which contains all writes that occurred during the copying time. Restoring a backup with Xtrabackup requires a “prepare” phase, which is basically applying all writes from the xtrabackup_log file onto data files. When you do the prepare phase, a new file named xtrabackup_binlog_pos_innodb will appear in the restore directory. This file contains the replication coordinates that we need to reestablish replication. With or without the –no-lock option, xtrabackup_binlog_pos_innodb is created in your restore directory!
Even a 1-second stall can be disruptive for a busy server, but our locks used to last around 30 seconds. That was before we adopted the –no-lock option combined with getting replication coordinates from the xtrabackup_binlog_pos_innodb for restores.
An important thing to note is that this works if you only use Innodb tables. You shouldn’t be using MyISAM tables anyway. Use them only for the unavoidable: the `mysql` db which contains grants and other internal metadata.