Additional Ceph nodes

Clearly, a storage cluster with no redundancy is of limited value. In this section we shall explore how to add additional OSD, MON and MDS nodes to achieve a fully redundant storage solution.

Additional OSD setup

You can see from the status output in the example below that we currently only have a single OSD configured and that the pgmap status is therefore shown as active+degraded. You can also see that there is a HEALTH_WARN flag displayed and that all 384 pages are marked as degraded with recovery stuck at fifty percent. We clearly need to configure at least one more OSD.

host1 ~ # ceph --status

  cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb
   health HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded (50.000%)
   monmap e1: 1 mons at {a=10.0.0.70:6789/0}, election epoch 2, quorum 0 a
   osdmap e3: 1 osds: 1 up, 1 in
    pgmap v8: 384 pgs: 384 active+degraded; 9518 bytes data, 127 MB used, 9296 MB / 9951 MB avail; 21/42 degraded (50.000%)
   mdsmap e4: 1/1/1 up {0=a=up:active}

Adding another OSD to the configuration is trivial. The example below shows the modifications required to add another OSD on a computer named host2.

/etc/ceph/ceph.conf

[global]
fsid = f7693e88-148c-41f5-bd40-2fedeb00bfeb
auth cluster required = none
auth service required = none
auth client required = none
keyring = /etc/ceph/keyring.admin

[mon]
keyring = /etc/ceph/keyring.$name

[mds]
keyring = /etc/ceph/keyring.$name

[osd]
osd data = /mnt/ceph
osd journal = /mnt/ceph/journal
osd journal size = 100
filestore xattr use omap = true
keyring = /etc/ceph/keyring.$name

[mon.a]
host = host1
mon addr = 10.0.0.70:6789

[mds.a]
host = host1

[osd.0]
host = host1

[osd.1]
host = host2

Of course, the configuration files need to be identical on all hosts so we need to copy the configuration file to the new host as shown below.

host1 ~ # scp /etc/ceph/ceph.conf root@host2:/etc/ceph/ceph.conf

We can now create a logical volume to store our ceph data files. The example below would create a 1TB logical volume named ceph in the host2_vg1 volume group, format it with the ext4 filesystem and mount it at /mnt/ceph. Don't forget to add an entry to /etc/fstab so that it is mounted automatically after a re-boot.

host2 ~ # lvcreate -n ceph -L 1T host2_vg1
host2 ~ # mkfs.ext4 /dev/host2_vg1/ceph
host2 ~ # mkdir /mnt/ceph
host2 ~ # mount /dev/host2_vg1/ceph /mnt/ceph

Once we have configured suitable storage for our ceph data files we need to prepare this storage by creating a journal, superblock and keyring. All these tasks can be accomplished using a single invocation of the ceph-osd command as shown below.

host2 ~ # ceph-osd -i 1 --mkfs --mkkey

2013-09-02 21:47:24.260347 7f9ad53d2780 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2013-09-02 21:47:24.420838 7f9ad53d2780 -1 journal FileJournal::_open: disabling aio for non-block journal.  Use journal_force_aio to force use of aio anyway
2013-09-02 21:47:24.426420 7f9ad53d2780 -1 filestore(/mnt/ceph) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2013-09-02 21:47:24.487748 7f9ad53d2780 -1 created object store /mnt/ceph journal /mnt/ceph/journal for osd.1 fsid f7693e88-148c-41f5-bd40-2fedeb00bfeb
2013-09-02 21:47:24.487825 7f9ad53d2780 -1 auth: error reading file: /etc/ceph/keyring.osd.1: can't open /etc/ceph/keyring.osd.1: (2) No such file or directory
2013-09-02 21:47:24.487994 7f9ad53d2780 -1 created new key in keyring /etc/ceph/keyring.osd.1

Although we have disabled authentication we have still been taking the required steps to ensure that authentication should work when enabled. To ensure that the new OSD has the correct permissions to access the other members of the storage cluster we need to add the public key from its keyring and specify the correct ACLs. The example below demonstrates how this can be accomplished using the ceph command-line utility.

host2 ~ # ceph auth add osd.1 osd 'allow *' mon 'allow rwx' -i /etc/ceph/keyring.osd.1

added key for osd.1

Now that the correct permissions have been set we can create the new OSD using the command below.

host2 ~ # ceph osd create

With the new OSD configured and initialised we can start the ceph daemons.

host2 ~ # /etc/init.d/ceph start

=== osd.1 === 
create-or-move updating item name 'osd.1' weight 1.00 at location {host=host2,root=default} to crush map
Starting Ceph osd.1 on host2...
starting osd.1 at :/0 osd_data /mnt/ceph /mnt/ceph/journal

Once the data from the existing host has been replicated to this OSD the status should change to active+clean indicating that there is a clean copy of the data on this OSD. The overall health indicator should now show HEALTH_OK, as shown in the following example.

host2 ~ # ceph -s

  cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb
   health HEALTH_OK
   monmap e1: 1 mons at {a=10.0.0.70:6789/0}, election epoch 1, quorum 0 a
   osdmap e25: 2 osds: 2 up, 2 in
    pgmap v50: 384 pgs: 384 active+clean; 9518 bytes data, 254 MB used, 18591 MB / 19902 MB avail
   mdsmap e4: 1/1/1 up {0=a=up:active}

Additional OSD - CRUSH map

Ceph uses a special algorithm called CRUSH which determines how to store and retrieve data by computing storage locations. When adding a new OSD it is usually necessary to modify the CRUSH map, which is essentially a configuration file which tells provides the CRUSH algorithm with some critical information relating to how the data should be arranged in your datacentre to ensure maximum redundancy.

There are two methods of modifying the CRUSH map. The first, shown below, uses the ceph command-line utilities to retreive the binary CRUSH map and the crushtool application to decompile the binary into a text file suitable for editing. Once editing is complete the crushtool application may be used to compile the new CRUSH map and the ceph command-line utility can be used to set it on the active cluster. More information documenting this method of modifying the CRUSH map may be found in the official CRUSH maps documentation◳.

host2 ~ # ceph osd getcrushmap -o crushmap.bin

got crush map from osdmap epoch 25

host2 ~ # crushtool -d crushmap.bin -o crushmap.txt

host2 ~ # nano -w crushmap.txt

host2 ~ # crushtool -c crushmap.txt -o crushmap.bin
host2 ~ # ceph osd setcrushmap -i crushmap.bin

set crush map for osdmap epoch 26

Alternatively, the CRUSH map may be modified directly using the ceph command-line utility, as shown below. More documentation may be found in the Adjusting the CRUSH map◳ section of the official documentation.

host2 ~ # ceph osd crush set osd.1 1.0 root=default rack=rack1 host=host2

set item id 1 name 'osd.1' weight 1 at location {root=default rack=rack1 host=host2} to crush map

There are also some other paramaters referred to as CRUSH tunables although all we shall say on the subject in this guide is that the optimal profile may be selected as shown below.

host2 ~ # ceph osd crush tunables optimal

adjusted tunables profile to optimal

Additional MON setup

Having redundant storage of data is nice but with only a single storage cluster monitor running there is still a single point of failure. In fact, as the majority of MON daemons need to be contactable by the OSD daemons we realy need an odd number of at least three configured MON daemons to have a truely redundant network. Luckily the processing and memory requirements of a MON daemon are fairly low so any always-on host which is well connected to the OSD servers is a good candidate.

As in the previous section, when we installed the primary MON daemon, we will need to create a logical volume to store the MON map files. This is critical to the correct operation of the MON daemon as it will automatically shut down should this location ever become more than 95% full. Don't forget to add the new logical volume to the /etc/fstab file so it is automatically mounted.

host2 ~ # lvcreate -n ceph-mon -L 2G host2_vg1
host2 ~ # mkfs.ext4 /dev/host2_vg1/ceph-mon
host2 ~ # mkdir -p /var/lib/ceph/mon
host2 ~ # mount /dev/host2_vg1/ceph-mon /var/lib/ceph/mon
host2 ~ # mkdir /var/lib/ceph/mon/ceph-b /tmp/ceph

Clearly, for the OSD, MDS and other MON daemons to know that the new daemon exists it needs to be added to the configuration file of every node in the storage cluster. The example below details the changes required to our existing configuration file to add a new MON daemon named host2 with the network address 10.0.0.1. As usual you will probably need to modify these values to reflect your network.

/etc/ceph/ceph.conf

[global]
fsid = f7693e88-148c-41f5-bd40-2fedeb00bfeb
auth cluster required = none
auth service required = none
auth client required = none
keyring = /etc/ceph/keyring.admin

[mon]
keyring = /etc/ceph/keyring.$name

[mds]
keyring = /etc/ceph/keyring.$name

[osd]
osd data = /mnt/ceph
osd journal = /mnt/ceph/journal
osd journal size = 100
filestore xattr use omap = true
keyring = /etc/ceph/keyring.$name

[mon.a]
host = host1
mon addr = 10.0.0.70:6789

[mon.b]
host = host2
mon addr = 10.0.0.1:6789

[mds.a]
host = host1

[osd.0]
host = host1

[osd.1]
host = host2

Now that we have configured the new MON daemon and created somewhere for the MON map files to be stored we can get a temporary key and the current MON map from the cluster.

host2 ~ # ceph auth get mon. -o /tmp/ceph/keyring.mon

exported keyring for mon.

host2 ~ # ceph mon getmap -o /tmp/ceph/monmap

got latest monmap

We can then use the ceph-mon application to create a new monfs using the temporary keyring and MON map we obtained in the previous step.

host2 ~ # ceph-mon -i b --mkfs --monmap /tmp/ceph/monmap --keyring /tmp/ceph/keyring.mon

ceph-mon: set fsid to f7693e88-148c-41f5-bd40-2fedeb00bfeb
ceph-mon: created monfs at /var/lib/ceph/mon/ceph-b for mon.b

Finally we can start the new MON daemon. As you can see in the example below we have provided the identity of the MON daemon (with -i b) and the public address and port.

host2 ~ # ceph-mon -i b --public-addr 10.0.0.1:6789

With the new MON daemon running we can add it to the list of quorum votes using the ceph command-line utility as shown below.

host2 ~ # ceph mon add b 10.0.0.1:6789

We can verify that the new MON daemon is indeed running, and has become part of the quorum, using the command below. As you can see we now have 2 mons and a quorum consisting of 0,1 a,b.

host2 ~ # ceph -s

  cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb
   health HEALTH_OK
   monmap e2: 2 mons at {a=10.0.0.70:6789/0,b=10.0.0.1:6789/0}, election epoch 2, quorum 0,1 a,b
   osdmap e30: 2 osds: 2 up, 2 in
    pgmap v153: 384 pgs: 384 active+clean; 2016 MB data, 4326 MB used, 1864 GB / 1968 GB avail
   mdsmap e9: 1/1/1 up {0=a=up:active}

To ensure that split brain◳ is impossible ceph requires that a quorum◳ (in this case consisting of a majority of eligible monitor daemons) must be be reachable. It is therefore critical that your cluster has an odd number of monitor daemons. Three MON daemons allows one to fail or be down for maintenance, five MON daemons allows for two to have failed or be down for maintenance, etc.

Additional MDS setup

To achieve complete redundancy we also need to configure at least one additional MDS daemon. The example configuration below describes the changes which need to be made to our current configuration to add another MDS daemon on host2. As always, changes to the ceph configuration file should be replicated across all memebers of the storage cluster.

/etc/ceph/ceph.conf

[global]
fsid = f7693e88-148c-41f5-bd40-2fedeb00bfeb
auth cluster required = none
auth service required = none
auth client required = none
keyring = /etc/ceph/keyring.admin

[mon]
keyring = /etc/ceph/keyring.$name

[mds]
keyring = /etc/ceph/keyring.$name

[osd]
osd data = /mnt/ceph
osd journal = /mnt/ceph/journal
osd journal size = 100
filestore xattr use omap = true
keyring = /etc/ceph/keyring.$name

[mon.a]
host = host1
mon addr = 10.0.0.70:6789

[mon.b]
host = host2
mon addr = 10.0.0.1:6789

[mds.a]
host = host1

[mds.b]
host = host2

[osd.0]
host = host1

[osd.1]
host = host2

Once the new MDS daemon has been configured we can obtain a new key using the ceph-authtool utility, as shown below.

host2 ~ # ceph-authtool --create-keyring --gen-key -n mds.b /etc/ceph/keyring.mds.b

creating /etc/ceph/keyring.mds.b

We can then set the correct authenticaton tokens and permissions so that the new MDS daemon has access to the rest of the storage cluster.

host2 ~ # ceph auth add mds.b osd 'allow *' mon 'allow rwx' mds 'allow' -i /etc/ceph/keyring.mds.b

added key for mds.b

Finally, we can start the new MDS daemon.

host2 ~ # ceph-mds -i b

Assuming all is well and the new MDS daemon is operating correctly the ceph status should now include an mdsmap entry similar to 1 up:standby, as shown below.

host2 ~ # ceph -s

  cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb
   health HEALTH_OK
   monmap e3: 3 mons at {a=10.0.0.70:6789/0,b=10.0.0.1:6789/0,c=10.0.0.73:6789/0}, election epoch 4, quorum 0,1,2 a,b,c
   osdmap e30: 2 osds: 2 up, 2 in
    pgmap v162: 384 pgs: 384 active+clean; 2016 MB data, 4326 MB used, 1864 GB / 1968 GB avail
   mdsmap e10: 1/1/1 up {0=a=up:active}, 1 up:standby