RADOS Block Device as Xen DomU storage

A RADOS Block Device can be treated just like any other block device but it also offers advanced features such as redundancy and location independence which makes it a near perfect storage solution for Xen guest domains.

Traditionally shared storage has to be provisioned in the host domain and provided to a guest domain as a virtual block device. As a RADOS block device (RBD) is a type of (network) block device however there is no need for any of this and the guest domain can be made responsible for connecting to the correct RBD using its own configuration information. This is particularly advantageous when live migration of trusted guest domains between a large pool of host domains is desirable.

Preparing to boot from an RBD

If we intend to boot a Xen guest domain from an RBD, or any other network block device for that matter, we will need to build an initial ram-disk (initrd) so that the correct kernel modules required to access such a device can be loaded and the configuration parsed from extra parameters specified in the guest configuration.

Installing the Dracut RBD module

We shall therefore need to install the Dracut RBD module, and the Dracut application if it is not already present, on whichever host is most convenient. In this example we are using a host named portage for this purpose.

portage ~ # emerge sys-boot/rbd-dracut-module

Configuring an RBD capable kernel

Assuming that you already have a correctly configured kernel for your Xen guest domains then there are only two option which will need to be enabled before you can access a RADOS block device. The first of these is the RADOS block device (RBD) and can be found in the Block devices section.

Block devices
[*]	Rados block device (RBD)	CONFIG_BLK_DEV_RBD

The second is the Ceph distributed file system which can be found under the Network File Systems subsection.

Network File Systems
[*]	Ceph distributed file system	CONFIG_CEPH_FS

Once the correct kernel options have been enabled you will need to rebuild the kernel and modules, as shown below. You will also need to make a copy of the kernel image somewhere that is accessible from the Xen host domain.

portage linux # make && make modules_install
portage src # cp linux/arch/x86_64/boot/bzImage images/kernel/linux-3.10.7-gentoo-r1

Building an RBD capable initrd

Now that we have an RBD capable kernel and the Dracut application installed we can build an RBD capable initrd. As you can see from the example below we have excluded the mdraid, lvm, lkg and nfs dracut modules to reduce the size of the resulting initrd. If you are not using the default configuration, which includes all installed modules, you may need to ensure that the rbd module is included by specifying --add rbd on the command line.

portage src # dracut images/rbdboot/linux-3.10.7-gentoo-r1.img 3.10.7-gentoo-r1 --no-hostonly --omit "mdraid lvm lkg nfs"

Once the initrd has been built you will need to make a copy of the image in the same location as the kernel image from the previous step.

Installing the RBD client tools

Whilst an RBD enabled kernel and initrd are sufficient to boot a Xen guest from a RADOS block device some additional tools may be required, especially if you need to mount more than just the root volume. The sys-cluster/rbd-client-tools provides a mount helper capable of mapping and mounting an RBD from an entry in the fstab file. Any required dependencies will of course be automatically installed.

guest ~ # emerge sys-cluster/rbd-client-tools

The Xen guest should also have a copy of the Ceph configuration file. The easiest method of obtaining one is to copy it from one of the nodes in the storage cluster, as shown below.

guest ~ # scp root@host1:/etc/ceph/ceph.conf /etc/ceph/ceph.conf

Preparing the guest for network boot

Before we prepare the Xen guest for network boot we should probably make a snapshot of the logical volume that is currently used to store the guest's root filesystem. This way, should anything go wrong, we can easily restore the guest to a functional state. We should temporarily stop the guest domain before we make the snapshot.

host1 ~ # xl shutdown -w guest
host1 ~ # lvcreate -s -n guest-vm-snapshot -L 2G /dev/host1_vg1/guest-vm
host1 ~ # xl create /etc/xen/vm/guest

As the guest's root filesystem will be mounted over the network it is vital that the network is never shut down, which will be the case when the Xen guest is terminated with the shutdown command unless steps are taken to prevent it.

The first step we need to take is to remove the init script for the default network interface, probably eth0.

guest ~ # rm /etc/init.d/net.eth0

Next we need to inform the init system that the network service will be provided by net.lo. This is not strictly true but as the network related services will have been started by the initrd something needs to be marked as proving them and there is no better candidate.

/etc/rc.conf

rc_net_lo_provide="net"

Finally, we need to modify the /etc/fstab file to reference the new paths. As you can see in the example below the device on which the root filesystem is mounted will now be presented as /dev/root and is formatted using the ext4 filesystem. The next entry represents an example of an additional mount point which used to be presented at /dev/xvda2 but is now identified by the RBD pool and the device name separated by a forward slash. The type of the filesystem has also been changed from ext4 to rbd as the actual type of the filesystem will now be determined automatically.

/etc/fstab

/dev/xvda1       /               ext4    barrier=0,noatime    0 1
/dev/root        /               ext4    barrier=0,noatime    0 0

/dev/xvda2       /mnt/somewhere  ext4    barrier=0,noatime    0 2
pool/device      /mnt/somewhere  rbd     barrier=0,noatime    0 1

When specifying the RADOS block device to mount it is critical that there is no leading forward slash before the RBD pool and the device name. If you do not specify the pool name when creating the RBD it will default to rbd however this default must still be specified in /etc/fstab entries.

Duplicating an LVM volume to an RBD

Of course, before we can boot the guest domain from an RBD we will first need to create such a device. Ideally this device would contain an exact copy of the filesystem as it exists on the current backing storage, presumable an LVM volume. Fortunately, as the example below shows, the rbd utility can be used to easily import the contents of an existing block device. As usual, we should shut down the guest domain before we copy its filesystem(s).

host1 ~ # xl shutdown -w guest
host1 ~ # rbd import /dev/host1_vg1/guest-vm guest-vm --image-format 2
host1 ~ # rbd import /dev/host1_vg1/guest-vx-database guest-vx-database --image-format 2

The only parameters of interest are the source path of the original block device, the new name for the RBD (the pool name will default to rbd in this example as none is specified) and the image format. Version two is usually preferred as it offers some additional features over version one however version one is still the default.

Reconfigure the Xen guest

The last step before we can start our Xen guest domain using a RADOS block device as its root filesystem is to modify the Xen configuration for the guest domain to provide the information that will be required to connect to the network and mount the RBD. Whilst the example provided below looks scary it is actually fairly simple.

The first change is to remove any nomigrate entries. After we have finished we will be able to migrate the guest domain freely to any Xen host which has network access to the current host and storage cluster.

/etc/xen/vm/guest

# General
name = "guest";
memory = "1024";
vcpus = "1";

# Disable migration of this guest
nomigrate = "1";

# Booting
kernel = "/usr/xen/kernel64/linux-3.10.7-gentoo-r1";
ramdisk = "/usr/xen/kernel64/linux-3.10.7-gentoo-r1.img";
extra = "console=hvc0 vdso32=0 raid=noautodetect rootflags=barrier=0";
extra = "ip=10.0.1.5::10.0.1.6:255.255.255.252:guest:eth0:off nameserver=10.0.1.21 nameserver=10.0.1.109 net.ifnames=0 console=hvc0 vdso32=0 raid=noautodetect"
root = "/dev/xvda1 rw";
root = "rbd:rbd:guest-vm:ext4:barrier=0,noatime";

# Virtual network interface(s)
vif = [ "vifname=vif.guest, ip=10.0.1.6:255.255.255.252" ];

# Virtual harddisk(s)
disk = [ "phy:/dev/host1_vg1/guest-vm,xvda1,w", "phy:/dev/host1_vg1/guest-vx-database,xvda2,w" ];

The second change is the addition of a ramdisk entry which should point to the initrd we created earlier.

The third change is the addition (or modification) of an extra entry. As you can see this variable is responsible for providing boot-time configuration information relating to the network. Any existing extra parameters should be appended to the new entry as shown.

The fourth change is the modification of the root entry. As you can see from the above example the RBD specification begins with the text rbd: (to indicate that we are mounting an RBD), next comes the pool name followed by a colon (in this case also rbd) before finally the name of the RBD is provided followed by the filesystem type and any mount options.

The fifth and final change is the removal of the disk section as it is no longer needed. As you can see we can remove all the existing disk definitions as we have already specified any additional devices in /etc/fstab on the guest.

Test the guest domain

We should now be able to create the guest domain as usual. We have not provided complete output in the example below merely the critical lines which indicate the progress of mounting the RBD.

host1 ~ # xl create -c /etc/xen/vm/guest

Command line: root=rbd:rbd:guest-vm:ext4:barrier=0,noatime ip=10.0.1.5::10.0.1.6:255.255.255.252:guest:eth0:off nameserver=10.0.1.21 nameserver=10.0.1.109 net.ifnames=0 biosdevname=0 console=hvc0 vdso32=0 raid=noautodetect

ceph: loaded (mds proto 32)

rbd: loaded rbd (rados block device)

Key type ceph registered

libceph: loaded (mon/osd proto 15/24)

libceph: client11213 fsid f7693e88-148c-41f5-bd40-2fedeb00bfeb

libceph: mon2 10.0.0.73:6789 session established

 rbd1: unknown partition table

rbd: rbd1: added with size 0x80000000

EXT4-fs (rbd1): mounted filesystem with ordered data mode. Opts: (null)

dracut: Remounting /dev/root with -o barrier=0,noatime,ro

EXT4-fs (rbd1): re-mounted. Opts: barrier=0

dracut: Mounted root filesystem /dev/rbd1

dracut: Switching root

Assuming everything worked out as planned the guest domain should have started normally using a RADOS block device as its root filesystem and we should be in a position to test live migration to a different host.

host1 ~ # xl migrate guest host2

Again, assuming everything worked as expected, the guest domain should now be running on the other host as if nothing had ever happened.