OpenStack: Recover Galera Cluster

OpenStack MySQL (MariaDB Galera Cluster) Recovery

Problem: Your MySQL Secondary Database will not start because of disk space, InnoDB problems, etc.

This hit me when the Keystone token cleanup got fouled up and I ended up with 900K expired token records. At that point the database was hosed and would not recover because transaction logs were greater than 1GB (default max replication size).

Helpful links on MySQL recovery:

Solved this by doing the following:

  • Stop the database on both primary (120) and secondary (220) lvosmysql database instances.
  • On the primary, start the database manually:
    sudo su - mysql
    /usr/bin/mysqld_safe --basedir=/usr
    

    Wait for the database to come up cleanly (review /var/log/mariadb/mariadb.log and do a test connection to verify).

  • On the secondary, because InnoDB was corrupted I had to add the following to /etc/my.cnf:
    # settings to recover in emergency
    innodb_force_recovery=5
    innodb_purge_threads=0
    port=8881
    

    NB: The port changes to keep the database from being hammered during recovery.
    Then run the database manually as with the master:

    sudo su - mysql
    /usr/bin/mysqld_safe --basedir=/usr
    

    This process works because the secondary will first detect that it needs the entire /var/lib/mysql/ibdata1 file; this is a Good Thing because it (in effect) forces the secondary to rebuild itself from the master. You can verify this by checking for rsync in the process list (I used lsof for this:

    [root@lvosmysql220 mariadb]# lsof | grep rsync
    wsrep_sst 6711           mysql  255r      REG              253,0        8771      45942 /usr/bin/wsrep_sst_rsync
    rsync     6738           mysql  cwd       DIR              253,0        4096     814419 /var/lib/mysql
    [...]
    rsync     6754           mysql   11r      REG              253,0 18733858816    1258906 /var/lib/mysql/ibdata1
    [...]
    

    Once the ibdata1 file is transferred, the database promptly halts because using innodb_force_recovery=5 places the database in read-only recovery mode. Which – since the entire database has been rescanned from the master – is no longer necessary. So comment out the emergency settings in /etc/my.cnf and manually restart the database on the secondary.

  • At this point, both primary and secondary database hosts should be synchronized and replication should report OK. The next step is to get the databases synchronized and committed with each other; in my case this meant a painful session of deleting 1000 Keystone token records at a time (to prevent the transaction log / replication processes from being overloaded). That took several hours.
    In your case, you will need to troubleshoot why your secondary database host failed to start and correct as needed.
  • Once the databases are finally at a good point (in my case, when all 900K worth of expired Keystone token records were deleted and committed to primary / secondary), you can stop the database on each server (remember: they are running from a manual prompt):
    mysqladmin -u[user] -p[password] -h[host] shutdown
    

    You run the above as root and you *wait for a clean shutdown* on each node. I recommend a full reboot of each server and careful verification that MySQL (MariaDB) starts up correctly after the reboot completes.

That is all.

Team-oriented systems mentor with deep knowledge of numerous software methodologies, technologies, languages, and operating systems. Excited about turning emerging technology into working production-ready systems. Focused on moving software teams to a higher level of world-class application development. Specialties:Software analysis and development...Product management through the entire lifecycle...Discrete product integration specialist!

Leave a Reply

Your email address will not be published. Required fields are marked *

*