RADOS Block Device as Xen DomU storage

A RADOS Block Device can be treated just like any other block device but it also offers advanced features such as redundancy and location independence which makes it a near perfect storage solution for Xen guest domains.

Traditionally shared storage has to be provisioned in the host domain and provided to a guest domain as a virtual block device. As a RADOS block device (RBD) is a type of (network) block device however there is no need for any of this and the guest domain can be made responsible for connecting to the correct RBD using its own configuration information. This is particularly advantageous when live migration of trusted guest domains between a large pool of host domains is desirable.

Preparing to boot from an RBD

If we intend to boot a Xen guest domain from an RBD, or any other network block device for that matter, we will need to build an initial ram-disk (initrd) so that the correct kernel modules required to access such a device can be loaded and the configuration parsed from extra parameters specified in the guest configuration.

Installing the Dracut RBD module

We shall therefore need to install the Dracut RBD module, and the Dracut application if it is not already present, on whichever host is most convenient. In this example we are using a host named portage for this purpose.

portage emerge sys-boot/rbd-dracut-module

Configuring an RBD capable kernel

Assuming that you already have a correctly configured kernel for your Xen guest domains then there are only two option which will need to be enabled before you can access a RADOS block device. The first of these is the RADOS block device (RBD) and can be found in the Block devices section.

Block devices
  • [*]
  • Rados block device (RBD)
  • CONFIG_BLK_DEV_RBD

The second is the Ceph distributed file system which can be found under the Network File Systems subsection.

Network File Systems
  • [*]
  • Ceph distributed file system
  • CONFIG_CEPH_FS

Once the correct kernel options have been enabled you will need to rebuild the kernel and modules, as shown below. You will also need to make a copy of the kernel image somewhere that is accessible from the Xen host domain.

portage linux make && make modules_install
portage src cp linux/arch/x86_64/boot/bzImage images/kernel/linux-3.10.7-gentoo-r1

Building an RBD capable initrd

Now that we have an RBD capable kernel and the Dracut application installed we can build an RBD capable initrd. As you can see from the example below we have excluded the mdraid, lvm, lkg and nfs dracut modules to reduce the size of the resulting initrd. If you are not using the default configuration, which includes all installed modules, you may need to ensure that the rbd module is included by specifying --add rbd on the command line.

portage src dracut images/rbdboot/linux-3.10.7-gentoo-r1.img 3.10.7-gentoo-r1 --no-hostonly --omit "mdraid lvm lkg nfs"

Once the initrd has been built you will need to make a copy of the image in the same location as the kernel image from the previous step.

Installing the RBD client tools

Whilst an RBD enabled kernel and initrd are sufficient to boot a Xen guest from a RADOS block device some additional tools may be required, especially if you need to mount more than just the root volume. The sys-cluster/rbd-client-tools provides a mount helper capable of mapping and mounting an RBD from an entry in the fstab file. Any required dependencies will of course be automatically installed.

guest emerge sys-cluster/rbd-client-tools

The Xen guest should also have a copy of the Ceph configuration file. The easiest method of obtaining one is to copy it from one of the nodes in the storage cluster, as shown below.

guest scp root@host1:/etc/ceph/ceph.conf /etc/ceph/ceph.conf

Preparing the guest for network boot

Before we prepare the Xen guest for network boot we should probably make a snapshot of the logical volume that is currently used to store the guest's root filesystem. This way, should anything go wrong, we can easily restore the guest to a functional state. We should temporarily stop the guest domain before we make the snapshot.

host1 xl shutdown -w guest
host1 lvcreate -s -n guest-vm-snapshot -L 2G /dev/host1_vg1/guest-vm
host1 xl create /etc/xen/vm/guest

As the guest's root filesystem will be mounted over the network it is vital that the network is never shut down, which will be the case when the Xen guest is terminated with the shutdown command unless steps are taken to prevent it.

The first step we need to take is to remove the init script for the default network interface, probably eth0.

guest rm /etc/init.d/net.eth0

Next we need to inform the init system that the network service will be provided by net.lo. This is not strictly true but as the network related services will have been started by the initrd something needs to be marked as proving them and there is no better candidate.

/etc/rc.conf
rc_net_lo_provide="net"

Finally, we need to modify the /etc/fstab file to reference the new paths. As you can see in the example below the device on which the root filesystem is mounted will now be presented as /dev/root and is formatted using the ext4 filesystem. The next entry represents an example of an additional mount point which used to be presented at /dev/xvda2 but is now identified by the RBD pool and the device name separated by a forward slash. The type of the filesystem has also been changed from ext4 to rbd as the actual type of the filesystem will now be determined automatically.

/etc/fstab
/dev/xvda1       /               ext4    barrier=0,noatime    0 1
/dev/root / ext4 barrier=0,noatime 0 0

/dev/xvda2 /mnt/somewhere ext4 barrier=0,noatime 0 2
pool/device /mnt/somewhere rbd barrier=0,noatime 0 1
Caution:
When specifying the RADOS block device to mount it is critical that there is no leading forward slash before the RBD pool and the device name. If you do not specify the pool name when creating the RBD it will default to rbd however this default must still be specified in /etc/fstab entries.
 

Duplicating an LVM volume to an RBD

Of course, before we can boot the guest domain from an RBD we will first need to create such a device. Ideally this device would contain an exact copy of the filesystem as it exists on the current backing storage, presumable an LVM volume. Fortunately, as the example below shows, the rbd utility can be used to easily import the contents of an existing block device. As usual, we should shut down the guest domain before we copy its filesystem(s).

host1 xl shutdown -w guest
host1 rbd import /dev/host1_vg1/guest-vm guest-vm --image-format 2
host1 rbd import /dev/host1_vg1/guest-vx-database guest-vx-database --image-format 2

The only parameters of interest are the source path of the original block device, the new name for the RBD (the pool name will default to rbd in this example as none is specified) and the image format. Version two is usually preferred as it offers some additional features over version one however version one is still the default.

Reconfigure the Xen guest

The last step before we can start our Xen guest domain using a RADOS block device as its root filesystem is to modify the Xen configuration for the guest domain to provide the information that will be required to connect to the network and mount the RBD. Whilst the example provided below looks scary it is actually fairly simple.

The first change is to remove any nomigrate entries. After we have finished we will be able to migrate the guest domain freely to any Xen host which has network access to the current host and storage cluster.

/etc/xen/vm/guest
# General
name = "guest";
memory = "1024";
vcpus = "1";

# Disable migration of this guest
nomigrate = "1";

# Booting
kernel = "/usr/xen/kernel64/linux-3.10.7-gentoo-r1";
ramdisk = "/usr/xen/kernel64/linux-3.10.7-gentoo-r1.img";
extra = "console=hvc0 vdso32=0 raid=noautodetect rootflags=barrier=0";
extra = "ip=10.0.1.5::10.0.1.6:255.255.255.252:guest:eth0:off nameserver=10.0.1.21 nameserver=10.0.1.109 net.ifnames=0 console=hvc0 vdso32=0 raid=noautodetect"
root = "/dev/xvda1 rw";
root = "rbd:rbd:guest-vm:ext4:barrier=0,noatime";

# Virtual network interface(s)
vif = [ "vifname=vif.guest, ip=10.0.1.6:255.255.255.252" ];

# Virtual harddisk(s)
disk = [ "phy:/dev/host1_vg1/guest-vm,xvda1,w", "phy:/dev/host1_vg1/guest-vx-database,xvda2,w" ];

The second change is the addition of a ramdisk entry which should point to the initrd we created earlier.

The third change is the addition (or modification) of an extra entry. As you can see this variable is responsible for providing boot-time configuration information relating to the network. Any existing extra parameters should be appended to the new entry as shown.

The fourth change is the modification of the root entry. As you can see from the above example the RBD specification begins with the text rbd: (to indicate that we are mounting an RBD), next comes the pool name followed by a colon (in this case also rbd) before finally the name of the RBD is provided followed by the filesystem type and any mount options.

The fifth and final change is the removal of the disk section as it is no longer needed. As you can see we can remove all the existing disk definitions as we have already specified any additional devices in /etc/fstab on the guest.

Test the guest domain

We should now be able to create the guest domain as usual. We have not provided complete output in the example below merely the critical lines which indicate the progress of mounting the RBD.

host1 xl create -c /etc/xen/vm/guest
Command line: root=rbd:rbd:guest-vm:ext4:barrier=0,noatime ip=10.0.1.5::10.0.1.6:255.255.255.252:guest:eth0:off nameserver=10.0.1.21 nameserver=10.0.1.109 net.ifnames=0 biosdevname=0 console=hvc0 vdso32=0 raid=noautodetect 
ceph: loaded (mds proto 32) 
rbd: loaded rbd (rados block device) 
Key type ceph registered 
libceph: loaded (mon/osd proto 15/24) 
libceph: client11213 fsid f7693e88-148c-41f5-bd40-2fedeb00bfeb 
libceph: mon2 10.0.0.73:6789 session established 
 rbd1: unknown partition table 
rbd: rbd1: added with size 0x80000000 
EXT4-fs (rbd1): mounted filesystem with ordered data mode. Opts: (null) 
dracut: Remounting /dev/root with -o barrier=0,noatime,ro 
EXT4-fs (rbd1): re-mounted. Opts: barrier=0 
dracut: Mounted root filesystem /dev/rbd1 
dracut: Switching root 

Assuming everything worked out as planned the guest domain should have started normally using a RADOS block device as its root filesystem and we should be in a position to test live migration to a different host.

host1 xl migrate guest host2

Again, assuming everything worked as expected, the guest domain should now be running on the other host as if nothing had ever happened.