RestoringCloudFrontendFromBackup

From T2B Wiki
Jump to navigation Jump to search

Context

In July 2024, we have lost our OpenNebula frontend VM (cloud2) after an attempt to reboot it. It was hosted on domz02, a basic QEMU/KVM standalone hypervizor managed with libvirt. The problem seemed to be that the system was not able to find the partition table in the qcow2 image. As there was no backup of the OpenNebula data and config files, the only solution to recover these things was to attach the image to a new VM and to recreate the partition table in the mounted image using the tool gpart. It eventually worked, but things would have far more easier if we had a simple backup of the directories containing the important ONE data and configuration files of our cloud system. And above all, the procedure we followed to restore the machine was sketched in a situation of emergency with no warranty that it would succeed.

Data and configuration items to backup

Here is a list of the important files/directories to backup:

  • /var/lib/one
  • /var/lib/mysql
  • /etc/one
  • /etc/my.cnf
  • /etc/my.cnf.d

Procedure to restore the backup

On the standalone hypervizor, create a new VM with same hardware characteristics as the previous (mac address, disk size, memory, cpu,...). The easiest way is to copy-paste the xml of the previous. Be careful that libvirt will complain about the fact that you reuse the mac address. The solution is simple: remove the nic of the previous VM to free the mac address. Here are some commands that might be useful for this step:

virsh list
virsh edit <machine_name>
virsh create <xml_description_of_vm>

Of course, you can also use the GUI virt-manager for most tasks. Be especially careful with the drivers (it should be virtio for the NIC and the drive). Also double-check that the drive and the memory have the same sizes as on the previous VM. And of course, don't reuse the disk of the previous VM, you have to create a new one (it's easy to do from the virt-manager GUI).

Once the VM is running, you'll have to reinstall the frontend on it with Quattor and Puppet. However, the VM must initially be reinstalled with machine-type 'puppet_node' and Puppet app set to 'servers' and role to 'none'. Why? Because if you directly reinstall the VM as frontend, the initization scripts that come with the ONE packages and also with Puppet will generate some new settings that might be tricky to overwrite with the backup. So to say, in the beginning of the restore process, the machine must be a vanilla one. Here is the vanilla profile used to reinstall the VM:

object template cloud2.wn.iihe.ac.be;
include 'machine-types/puppet_node';
# software repositories (should be last)
include PKG_REPOSITORY_CONFIG;

And in the Quattor file site/puppet/database, here is the setting for the hiera app and role:

    'cloud2.wn.iihe.ac.be', dict(
        'environment', 'prod',
        'app', 'servers',
        'role', 'none',
        'cloud', 'cloud2',

Once these changes have been pushed in Quattor repo, and before doing the aii-shellfe configure and install on the aii server, there are two things to do to avoid problems :

  • revoke the SinDES certificate of the machine on the aii server (if you don't do that, no SinDES ACL will be created since there is already a valid certificate for the machine);
  • revoke the Puppet certificate on the Puppet master machine.