Clearly, a storage cluster with no redundancy is of limited value. In this section we shall explore how to add additional OSD, MON and MDS nodes to achieve a fully redundant storage solution.
You can see from the status output in the example below that we currently only have a single OSD configured and that the pgmap status is therefore shown as active+degraded. You can also see that there is a HEALTH_WARN flag displayed and that all 384 pages are marked as degraded with recovery stuck at fifty percent. We clearly need to configure at least one more OSD.
cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb health HEALTH_WARN 384 pgs degraded; 384 pgs stuck unclean; recovery 21/42 degraded (50.000%) monmap e1: 1 mons at {a=10.0.0.70:6789/0}, election epoch 2, quorum 0 a osdmap e3: 1 osds: 1 up, 1 in pgmap v8: 384 pgs: 384 active+degraded; 9518 bytes data, 127 MB used, 9296 MB / 9951 MB avail; 21/42 degraded (50.000%) mdsmap e4: 1/1/1 up {0=a=up:active}
Adding another OSD to the configuration is trivial. The example below shows the modifications required to add another OSD on a computer named host2.
[global] fsid = f7693e88-148c-41f5-bd40-2fedeb00bfeb auth cluster required = none auth service required = none auth client required = none keyring = /etc/ceph/keyring.admin [mon] keyring = /etc/ceph/keyring.$name [mds] keyring = /etc/ceph/keyring.$name [osd] osd data = /mnt/ceph osd journal = /mnt/ceph/journal osd journal size = 100 filestore xattr use omap = true keyring = /etc/ceph/keyring.$name [mon.a] host = host1 mon addr = 10.0.0.70:6789 [mds.a] host = host1 [osd.0] host = host1
[osd.1] host = host2
Of course, the configuration files need to be identical on all hosts so we need to copy the configuration file to the new host as shown below.
We can now create a logical volume to store our ceph data files. The example below would create a 1TB logical volume named ceph in the host2_vg1 volume group, format it with the ext4 filesystem and mount it at /mnt/ceph. Don't forget to add an entry to /etc/fstab so that it is mounted automatically after a re-boot.
Once we have configured suitable storage for our ceph data files we need to prepare this storage by creating a journal, superblock and keyring. All these tasks can be accomplished using a single invocation of the ceph-osd command as shown below.
2013-09-02 21:47:24.260347 7f9ad53d2780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2013-09-02 21:47:24.420838 7f9ad53d2780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2013-09-02 21:47:24.426420 7f9ad53d2780 -1 filestore(/mnt/ceph) could not find 23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory 2013-09-02 21:47:24.487748 7f9ad53d2780 -1 created object store /mnt/ceph journal /mnt/ceph/journal for osd.1 fsid f7693e88-148c-41f5-bd40-2fedeb00bfeb 2013-09-02 21:47:24.487825 7f9ad53d2780 -1 auth: error reading file: /etc/ceph/keyring.osd.1: can't open /etc/ceph/keyring.osd.1: (2) No such file or directory 2013-09-02 21:47:24.487994 7f9ad53d2780 -1 created new key in keyring /etc/ceph/keyring.osd.1
Although we have disabled authentication we have still been taking the required steps to ensure that authentication should work when enabled. To ensure that the new OSD has the correct permissions to access the other members of the storage cluster we need to add the public key from its keyring and specify the correct ACLs. The example below demonstrates how this can be accomplished using the ceph command-line utility.
added key for osd.1
Now that the correct permissions have been set we can create the new OSD using the command below.
1
With the new OSD configured and initialised we can start the ceph daemons.
=== osd.1 === create-or-move updating item name 'osd.1' weight 1.00 at location {host=host2,root=default} to crush map Starting Ceph osd.1 on host2... starting osd.1 at :/0 osd_data /mnt/ceph /mnt/ceph/journal
Once the data from the existing host has been replicated to this OSD the status should change to active+clean indicating that there is a clean copy of the data on this OSD. The overall health indicator should now show HEALTH_OK, as shown in the following example.
cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb health HEALTH_OK monmap e1: 1 mons at {a=10.0.0.70:6789/0}, election epoch 1, quorum 0 a osdmap e25: 2 osds: 2 up, 2 in pgmap v50: 384 pgs: 384 active+clean; 9518 bytes data, 254 MB used, 18591 MB / 19902 MB avail mdsmap e4: 1/1/1 up {0=a=up:active}
Ceph uses a special algorithm called CRUSH which determines how to store and retrieve data by computing storage locations. When adding a new OSD it is usually necessary to modify the CRUSH map, which is essentially a configuration file which tells provides the CRUSH algorithm with some critical information relating to how the data should be arranged in your datacentre to ensure maximum redundancy.
There are two methods of modifying the CRUSH map. The first, shown below, uses the ceph command-line utilities to retreive the binary CRUSH map and the crushtool application to decompile the binary into a text file suitable for editing. Once editing is complete the crushtool application may be used to compile the new CRUSH map and the ceph command-line utility can be used to set it on the active cluster. More information documenting this method of modifying the CRUSH map may be found in the official CRUSH maps documentation◳.
got crush map from osdmap epoch 25host2 ~ # crushtool -d crushmap.bin -o crushmap.txt
set crush map for osdmap epoch 26
Alternatively, the CRUSH map may be modified directly using the ceph command-line utility, as shown below. More documentation may be found in the Adjusting the CRUSH map◳ section of the official documentation.
set item id 1 name 'osd.1' weight 1 at location {root=default rack=rack1 host=host2} to crush map
There are also some other paramaters referred to as CRUSH tunables although all we shall say on the subject in this guide is that the optimal profile may be selected as shown below.
adjusted tunables profile to optimal
Having redundant storage of data is nice but with only a single storage cluster monitor running there is still a single point of failure. In fact, as the majority of MON daemons need to be contactable by the OSD daemons we realy need an odd number of at least three configured MON daemons to have a truely redundant network. Luckily the processing and memory requirements of a MON daemon are fairly low so any always-on host which is well connected to the OSD servers is a good candidate.
As in the previous section, when we installed the primary MON daemon, we will need to create a logical volume to store the MON map files. This is critical to the correct operation of the MON daemon as it will automatically shut down should this location ever become more than 95% full. Don't forget to add the new logical volume to the /etc/fstab file so it is automatically mounted.
Clearly, for the OSD, MDS and other MON daemons to know that the new daemon exists it needs to be added to the configuration file of every node in the storage cluster. The example below details the changes required to our existing configuration file to add a new MON daemon named host2 with the network address 10.0.0.1. As usual you will probably need to modify these values to reflect your network.
[global] fsid = f7693e88-148c-41f5-bd40-2fedeb00bfeb auth cluster required = none auth service required = none auth client required = none keyring = /etc/ceph/keyring.admin [mon] keyring = /etc/ceph/keyring.$name [mds] keyring = /etc/ceph/keyring.$name [osd] osd data = /mnt/ceph osd journal = /mnt/ceph/journal osd journal size = 100 filestore xattr use omap = true keyring = /etc/ceph/keyring.$name [mon.a] host = host1 mon addr = 10.0.0.70:6789
[mon.b] host = host2 mon addr = 10.0.0.1:6789
[mds.a] host = host1 [osd.0] host = host1 [osd.1] host = host2
Now that we have configured the new MON daemon and created somewhere for the MON map files to be stored we can get a temporary key and the current MON map from the cluster.
exported keyring for mon.host2 ~ # ceph mon getmap -o /tmp/ceph/monmap
got latest monmap
We can then use the ceph-mon application to create a new monfs using the temporary keyring and MON map we obtained in the previous step.
ceph-mon: set fsid to f7693e88-148c-41f5-bd40-2fedeb00bfeb ceph-mon: created monfs at /var/lib/ceph/mon/ceph-b for mon.b
Finally we can start the new MON daemon. As you can see in the example below we have provided the identity of the MON daemon (with -i b) and the public address and port.
With the new MON daemon running we can add it to the list of quorum votes using the ceph command-line utility as shown below.
We can verify that the new MON daemon is indeed running, and has become part of the quorum, using the command below. As you can see we now have 2 mons and a quorum consisting of 0,1 a,b.
cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb health HEALTH_OK monmap e2: 2 mons at {a=10.0.0.70:6789/0,b=10.0.0.1:6789/0}, election epoch 2, quorum 0,1 a,b osdmap e30: 2 osds: 2 up, 2 in pgmap v153: 384 pgs: 384 active+clean; 2016 MB data, 4326 MB used, 1864 GB / 1968 GB avail mdsmap e9: 1/1/1 up {0=a=up:active}
To achieve complete redundancy we also need to configure at least one additional MDS daemon. The example configuration below describes the changes which need to be made to our current configuration to add another MDS daemon on host2. As always, changes to the ceph configuration file should be replicated across all memebers of the storage cluster.
[global] fsid = f7693e88-148c-41f5-bd40-2fedeb00bfeb auth cluster required = none auth service required = none auth client required = none keyring = /etc/ceph/keyring.admin [mon] keyring = /etc/ceph/keyring.$name [mds] keyring = /etc/ceph/keyring.$name [osd] osd data = /mnt/ceph osd journal = /mnt/ceph/journal osd journal size = 100 filestore xattr use omap = true keyring = /etc/ceph/keyring.$name [mon.a] host = host1 mon addr = 10.0.0.70:6789 [mon.b] host = host2 mon addr = 10.0.0.1:6789 [mds.a] host = host1
[mds.b] host = host2
[osd.0] host = host1 [osd.1] host = host2
Once the new MDS daemon has been configured we can obtain a new key using the ceph-authtool utility, as shown below.
creating /etc/ceph/keyring.mds.b
We can then set the correct authenticaton tokens and permissions so that the new MDS daemon has access to the rest of the storage cluster.
added key for mds.b
Finally, we can start the new MDS daemon.
Assuming all is well and the new MDS daemon is operating correctly the ceph status should now include an mdsmap entry similar to 1 up:standby, as shown below.
cluster f7693e88-148c-41f5-bd40-2fedeb00bfeb health HEALTH_OK monmap e3: 3 mons at {a=10.0.0.70:6789/0,b=10.0.0.1:6789/0,c=10.0.0.73:6789/0}, election epoch 4, quorum 0,1,2 a,b,c osdmap e30: 2 osds: 2 up, 2 in pgmap v162: 384 pgs: 384 active+clean; 2016 MB data, 4326 MB used, 1864 GB / 1968 GB avail mdsmap e10: 1/1/1 up {0=a=up:active}, 1 up:standby