RestoringCloudFrontendFromBackup
Context
In July 2024, we have lost our OpenNebula frontend VM (cloud2) after an attempt to reboot it. It was hosted on domz02, a basic QEMU/KVM standalone hypervizor managed with libvirt. The problem seemed to be that the system was not able to find the partition table in the qcow2 image. As there was no backup of the OpenNebula data and config files, the only solution to recover these things was to attach the image to a new VM and to recreate the partition table in the mounted image using the tool gpart. It eventually worked, but things would have far more easier if we had a simple backup of the directories containing the important ONE data and configuration files of our cloud system. And above all, the procedure we followed to restore the machine was sketched in a situation of emergency with no warranty that it would succeed.
Data and configuration items to backup
Here is a list of the important files/directories to backup:
- /var/lib/one
- /var/lib/mysql
- /etc/one
- /etc/my.cnf
- /etc/my.cnf.d
You will find all of them in /backup_mnt/backup/BACKUPS/cloud2/
Procedure to restore the frontend
On the standalone hypervizor, create a new VM with same hardware characteristics as the previous (mac address, disk size, memory, cpu,...). The easiest way is to copy-paste the xml of the previous. Be careful that libvirt will complain about the fact that you reuse the mac address. The solution is simple: remove the nic of the previous VM to free the mac address. Here are some commands that might be useful for this step:
virsh list virsh edit <machine_name> virsh create <xml_description_of_vm>
Of course, you can also use the GUI virt-manager for most tasks. Be especially careful with the drivers (it should be virtio for the NIC and the drive). Also double-check that the drive and the memory have the same sizes as on the previous VM. And of course, don't reuse the disk of the previous VM, you have to create a new one (it's easy to do from the virt-manager GUI).
Once the VM is running, you'll have to reinstall the frontend on it with Quattor and Puppet. However, the VM must initially be reinstalled with machine-type 'puppet_node' and Puppet app set to 'servers' and role to 'none'. Why? Because if you directly reinstall the VM as frontend, the initization scripts that come with the ONE packages and also with Puppet will generate some new settings that might be tricky to overwrite with the backup. So to say, in the beginning of the restore process, the machine must be a vanilla one. Here is the vanilla profile used to reinstall the VM:
object template cloud2.wn.iihe.ac.be; include 'machine-types/puppet_node'; # Mounting backup include 'config/nfs/common'; include 'config/ceph/cephfs.backup'; # software repositories (should be last) include PKG_REPOSITORY_CONFIG;
And in the Quattor file site/puppet/database, here is the setting for the hiera app and role:
'cloud2.wn.iihe.ac.be', dict( 'environment', 'prod', 'app', 'servers', 'role', 'none', 'cloud', 'cloud2',
Once these changes have been pushed in Quattor repo, and before doing the aii-shellfe configure and install on the aii server, there are two things to do to avoid problems (note those 2 steps are done when using quat -ri):
- revoke the SinDES certificate of the machine on the aii server (if you don't do that, no SinDES ACL will be created since there is already a valid certificate for the machine);
- revoke the Puppet certificate on the Puppet master machine.
When the Quattor installation is finished, you have to mount the backup share on the VM, and then you restore the files and directories with the command of your taste (in my case, I just used cp -ap). Of course, it is really important to preserve the permissions and ownerships of the source files.
With the data and configuration files being restored, it is now time to switch the VM back as a good old OpenNebula frontend. Revert the profile to something like this:
object template cloud2.wn.iihe.ac.be; include 'machine-types/puppet_node'; variable ONE_RELEASE = '6.6'; include 'features/one_frontend/light_config'; include 'features/one_frontend/one6.X/sunstone_apache_ssl'; # Mounting backup include 'config/nfs/common'; include 'config/ceph/cephfs.backup'; # Making backup of everything that is needed include 'components/cron/config'; '/software/components/cron/entries' = push( dict( 'name', 'backup_config_db_one', 'user', 'root', 'frequency', '30 */2 * * *', 'log', dict('disabled', false), 'command', '/usr/bin/rsync -avR /var/lib/one /var/lib/mysql /etc/one /etc/my.cnf /etc/my.cnf.d /backup_mnt/backup/BACKUPS/cloud2/.' ) ); # software repositories (should be last) include PKG_REPOSITORY_CONFIG;
and don't forget the Puppet database to generate the correct hiera description in the file /etc/puppetlabs/facter/facts.d/provisionning.yaml:
'cloud2.wn.iihe.ac.be', dict( 'environment', 'prod', 'app', 'opennebula', 'role', 'frontend', 'cloud', 'cloud2',