Introduction

What data should you backup?

When deciding what data should be protected by a backup plan, and how much should be invested in such a plan, it is often advisable to categorise all the data stored on a network or system using the categories and guidelines below. In my professional experience all data can be placed into one of these categories and almost all organisations will find that they have data from all of these groups. It is often a valuable exercise to undertake, even in an organisation with a backup strategy already in place, as new data is always being added and incorporated into existing systems.

  • Easily replaced
  • Replaced with some effort / time
  • Hard to replace
  • Mission critical

Easily replaced data generally consists of the software packages used to build the system and any other data which is distributed from a central repository not maintained by the organisation in question. This type of data is generally downloaded from some other source, usually from the Internet, and therefore falls under someone else's backup strategy. The contents of the distfiles directory on a Gentoo installation is a good example of easily replaced data. Data in this category should not be part of a backup plan as to do so is probably a waste of storage.

Some data can be replaced with minimal effort but may take longer than replacing data in the previous category. On a Gentoo Linux installation a good example of this would be any binary packages which have been built. Whilst such packages could be rebuilt the time taken to do so may be prohibitive if a quick return to productive work is desired. Data in this category should possibly be part of a backup plan but it is worth calculating the costs of the required data storage and comparing them to the costs of associated down-time while such data was recreated.

Hard to replace data consists of anything which is not mission critical but does not fall into the above two categories. This kind of data usually consists of user configuration files, simple document templates, browser favourites, etc. It could all be recreated or replaced with enough effort but the time taken to do so would definitely be prohibitive. This type of data should definitely be part of a backup strategy, probably including off-site storage and disaster recovery planning.

The final category of data is that which is deemed to be critical to the survival of the business. This should include all customer records, accounting information, databases, source code repositories, document storage / management systems, etc. Data in this category should be protected at all cost and needs to be available within a very short space of time following a disaster or data loss. Ideally such data should always be available regardless of any physical or technological disaster which may take place. A traditional backup and recovery strategy is probably not sufficient for this type of data and special high-availability systems should probably be investigated.

What type of backup system should you use?

There are essentially two categories of data stored on modern computer systems - files and databases. Both require a different approach to backups as they are inherently different in nature. Files can usually be backed-up using traditional file-management tools such as cp and tar while databases usually require special care be taken to ensure that the backup is performed in a manner which ensures relational integrity and consistency of the data at the application level.

Backing up files

There are many ways of backing up the files we use on our computers. In this guide we shall examine how to backup files using a variety of methods suitable for use in a production environment.

rsnapshot

rsnapshot is a utility for making snapshots of filesystems from local or remote computers using rsync. It then uses hard-links to make multiple full snapshots available whilst only requiring the storage for a single complete snapshot plus any incremental changes. As all data is transferred using rsync only files which have changed since the last snapshot was taken will be copied making rsnapshot extremely bandwidth efficient.

Whilst rsnapshot is excellent at making snapshots of filesystems it does have some shortcomings which may make it unsuitable for some uses. The main limitation is caused, ironically, by the main strength of the rsnapshot application - its use of rsync. The rsync utility will always transfer a complete copy of any modified files so rsnapshot will keep a copy of every changed file for every snapshot. This is usually not a problem as most files which change often are fairly small. If you regularly work with very large files however the storage requirements for keeping complete snapshots using rsnapshot may become prohibitive.

rdiff-backup

rdiff-backup is a utility similar to the rsnapshot application described above. Unlike rsnapshot however rdiff-backup stores its incremental backups using reverse diffs which enables it to be much more efficient when storing large files. This storage efficiency comes at a price however as the reverse diff files take much longer to create and apply than performing a simple copy of the file for most files. For this reason a combination of rsnapshot to backup "normal" files and rdiff-backup to backup "large" files is often used.

Backing up databases

When performing a backup of a database server there are three main approaches, all of which will be explored in this guide. We shall investigate the relative merits and weaknesses as well as how to perform a database backup by making an SQL dump, performing filesystem level backups using LVM snapshots and investigate on-line backups and Point In Time Recovery all in the context of a PostgreSQL database cluster in a production environment.

SQL dump

The SQL dump method of database backup aims to produce a file, or files, containing the SQL commands required to completely recreate a database, or databases, complete with users, stored procedures, triggers, table layouts and data, and any other information the database server sees fit to store. The file(s) can then be executed on an empty database server and should result in a complete recreation of the original database.

The PostgreSQL database provides tools capable of making an SQL dump of an existing database as part of the default installation. In this guide we shall describe how to perform a database backup, using the pg_dump and pg_dumpall utilities, as well as how we can store a number of these backups using the minimum space with rdiff-backup.

Filesystem snapshots

As mentioned previously another common technique when performing a database backup is to use the volume snapshot feature provided by LVM to make a backup of the database files in a consistent manner.

This guide will provide a description of how to perform a backup of a PostgreSQL database using the volume snapshot feature provided by LVM and will also investigate the data consistency implications caused by backing-up a running database server.

On-line backup and Point In Time Recovery

The third, and by far the most elegant, method of performing a database backup uses an additional database server, or servers, configured to perform continuous on-line backups. These backups can also be used to provide Point In Time Recovery which is the database equivalent of the snapshots we created using rsnapshot when backing-up files.

In this guide we shall document how to configure the PostgreSQL server to perform on-line backup and Point In Time Recovery (PITR) in an efficient and reliable manner.