Reliability

Reliability is, as all competent system administrators know, one of the most critical features of any production system. This section is devoted to techniques which can be used to increase the reliability of a computer system at both the hardware and software levels.

The documents in this section will explore a wide variety of methods by which overall system reliability can be increased. These range from monitoring the individual components of a system, such as harddisks, for imminent signs of failure, to implementing high-availability clusters capable of withstanding catastrophic failure of major system components or even entire machines.

Backup system

No matter how reliable the hardware, how well written the software, how well configured, or how physically secure a computer system is eventually it will fail - usually in an unexpected manner. As it is impossible to completely protect against this kind of failure all we can do to mitigate the losses associated with such a scenario is to protect valuable data stored on the system from loss.

This documentation aims to be a thorough discussion of backup systems and strategies including the use of a variety of technologies to backup files and databases in the most common usage scenarios. We shall also discuss the use of the Internet and the "Cloud" as storage mediums for off-site backups as well as covering how to install and configure some common production software to operate in a high-availability environment.

Hardware monitoring

Whilst all computer hardware will eventually fail there are many measures which can be taken to reduce the likelihood of such a failure or to predict when such a failure is imminent. These days all computer hardware such as motherboards, CPUs, memory expansion modules, graphics adapters and harddisks have some inbuilt capabilities to monitor the health of the system. These capabilities vary but usually consist of measuring some combination of voltage, temperature, vibration or rotation speed to predict when each device requires routine maintenance or is reaching the end of its life and requires replacing.

In this section we shall look at the hardware monitoring capabilities of the Linux kernel and install and configure software to monitor and record voltages, temperatures and fan speeds. We shall also examine the use of applications to monitor and log SMART data from harddisk drives. Finally we shall explain how to configure applications to use the Simple Network Management Protocol, or SNMP as it is more commonly known, to make this information available over a network as well as how to monitor and log this information.

System logging with syslog-ng

When operating a production installation the availability and reliability of services, especially those deemed mission-critical, is of primary importance. Most failures, when analysed after the event, could have been prevented had adequate system logging and reporting systems been in place to ensure that the conditions leading up to the failure were noted and remedial action taken. Such automated log analysis tools can only operate effectively if they have a reliable and consistent set of log files to work from.

This document details how to install and configure the syslog-ng system logging daemon to fulfil a variety of requirements ranging from simply collecting the messages generated by a single system and recording them in a text file to receiving the logs from dozens of different machines and storing them in a database for further analysis while sending email notifications of serious events.