Monitoring harddisk SMART data

Installing the package

Before we install any packages we should ensure that the correct use-flags are configured so that all required functionality is made available and unnecessary functionality is not included. The smartmontools package and its dependencies provide a variety of use-flags only some of which will be discussed further here. As usual feel free to add and remove use-flags at will although the minimum set which are required for using this guide in its entirety are shown below.

lisa emerge -pv smartmontools
 
These are the packages that would be merged, in order:

Calculating dependencies... done!
[ebuild      ] sys-apps/smartmontools-5.38  USE="-minimal -static"

Once you are confident that the correct use-flags are set for the smartmontools package, and any dependencies it may require, you can proceed with the installation by issuing the emerge command shown below.

lisa emerge smartmontools

Manual verification of SMART data

Now that we have the tools we require to initiate SMART tests and view the results of those tests we can begin by checking that our harddisk drives are SMART capable, which they will be unless they are very old indeed, and that SMART support is enabled for these drives.

The command shown below uses the smartctl application to interrogate the first SATA, SCSI or SAS harddisk. The information displayed in our example gives an idea of what to expect although clearly it will be different unless you are using the same make and model of drive.

lisa smartctl -i /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF INFORMATION SECTION === 
Model Family:     Seagate Barracuda ES 
Device Model:     ST3250620NS 
Serial Number:    5QE53KR5 
Firmware Version: 3.AEK 
User Capacity:    250,059,350,016 bytes 
Device is:        In smartctl database [for details use: -P show] 
ATA Version is:   7 
ATA Standard is:  Exact ATA specification draft version not indicated 
Local Time is:    Mon Mar 22 15:23:28 2010 CET 
SMART support is: Available - device has SMART capability. 
SMART support is: Enabled 

The most significant two lines of output, as far as we are concerned, are the last two. The above example shows output from a system where SMART support is both available and enabled. If the last line indicates that SMART support is disabled, but the line above indicates that SMART support is available, then the command below should be issued to enable SMART support. If, on the other hand, SMART support is unavailable you should check the settings in the BIOS and enable SMART support before performing the above test again.

lisa smartctl -s on /dev/sda

Assuming that our target harddisk either already had SMART support enabled, or we have successfully enabled it in the above step, we can check to see what capabilities are offered by our particular make and model of device. The example command below shows how.

lisa smartctl -c /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF READ SMART DATA SECTION === 
General SMART Values: 
Offline data collection status:  (0x82)	Offline data collection activity 
                                        was completed without error. 
                                        Auto Offline Data Collection: Enabled. 
Self-test execution status:      (   0)	The previous self-test routine completed 
                                        without error or no self-test has ever  
                                        been run. 
Total time to complete Offline  
data collection:                 ( 430) seconds. 
Offline data collection 
capabilities:                    (0x5b) SMART execute Offline immediate. 
                                        Auto Offline data collection on/off support. 
                                        Suspend Offline collection upon new 
                                        command. 
                                        Offline surface scan supported. 
                                        Self-test supported. 
                                        No Conveyance Self-test supported. 
                                        Selective Self-test supported. 
SMART capabilities:            (0x0003) Saves SMART data before entering 
                                        power-saving mode. 
                                        Supports SMART auto save timer. 
Error logging capability:        (0x01)	Error logging supported. 
                                        General Purpose Logging supported. 
Short self-test routine  
recommended polling time:        (   1) minutes. 
Extended self-test routine 
recommended polling time:        (  92) minutes. 
SCT capabilities:              (0x003d) SCT Status supported. 
                                        SCT Feature Control supported. 
                                        SCT Data Table supported. 

As you can see the output from the above command is extensive. We shall not go into great detail here explaining what every line of the above output means, there are plenty of resources available on the Internet should a detailed explanation be of interest, instead we shall simply direct your attention to the last few lines of output which allow us to determine what tests we can perform on our drives and what data we can expect from those tests.

The most interesting to us, at least for the scope of this guide, are those indicating the recommended polling time for both short and long self-test routines and whether the drive exposes any SCT capabilities.

Now that we know that a short self-test routine is available, and that the recommended polling time for that test is one minute, we can initiate such a short self-test using the command shown below. You can also see that the smartctl application will give an estimate of how long the test will take to complete, in our case 430 seconds or about 7 minutes.

lisa smartctl -t short /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === 
Sending command: "Execute SMART Short self-test routine immediately in off-line mode". 
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful. 
Testing has begun. 
Please wait 430 seconds for test to complete. 
Test will complete after Mon Mar 22 15:32:09 2010 
 
Use smartctl -X to abort test. 

As we have just initiated a short self-test routine we probably want to see what the result of that test was. The command below will display the self-test log maintained by the drive.

lisa smartctl -l selftest /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF READ SMART DATA SECTION === 
SMART Self-test log structure revision number 1 
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error 
# 1  Short Offline       Completed without error       00%        37         - 

Hopefully you will see something similar to that shown above indicating that the short test routine has completed without error.

Warning:
If any errors are shown then continue reading for how to retrieve more detailed information. Now might also be a good time to make a backup of the contents of that drive if you do not already have one!
 

Assuming that the short self-test routine completed without error we can initiate a more thorough self-test routine with the command shown below. As you can see this more thorough test will take considerably longer than the short self-test in the previous example.

lisa smartctl -t long /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === 
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". 
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. 
Testing has begun. 
Please wait 92 minutes for test to complete. 
Test will complete after Mon Mar 22 17:04:29 2010 
 
Use smartctl -X to abort test. 

Given that the long self-test takes such a long time you may be interested enough in its progress to want to check that it is being performed and if any errors have been located thus far. The same command as before can be issued while a test is running and will provide output similar to that below. In this example there is less than 90% of the test remaining, the test is in progress and no errors have been located.

lisa smartctl -l selftest /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF READ SMART DATA SECTION === 
SMART Self-test log structure revision number 1 
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error 
# 1  Short Offline       Completed without error       00%        37         - 
# 2  Extended offline    Self-test routine in progress 90%        38         - 

Once the test has completed the new log entry will be moved to the top of the list, as shown in the example below.

lisa smartctl -l selftest /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF READ SMART DATA SECTION === 
SMART Self-test log structure revision number 1 
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error 
# 1  Extended offline    Completed without error       00%        38         - 
# 2  Short Offline       Completed without error       00%        37         - 

Now that we have performed both a short and a long self-test routine there is one more command which may be of interest. The command shown below will display all the SMART parameters recorded by the drive in a tabular form providing much more detailed information than the simple pass or fail provided in the self-test logs. The example below shows information from the same drive as above after it has been in use for some time.

lisa smartctl -A /dev/sda
 
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen 
Home page is http://smartmontools.sourceforge.net/ 
 
=== START OF READ SMART DATA SECTION === 
SMART Attributes Data Structure revision number: 10 
Vendor Specific SMART Attributes with Thresholds: 
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE 
  1 Raw_Read_Error_Rate     0x000f   111   097   006    Pre-fail  Always       -       155501215 
  3 Spin_Up_Time            0x0003   095   095   000    Pre-fail  Always       -       0 
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       14 
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0 
  7 Seek_Error_Rate         0x000f   085   060   030    Pre-fail  Always       -       382279888 
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7461 
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0 
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       14 
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0 
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0 
190 Airflow_Temperature_Cel 0x0022   055   047   045    Old_age   Always       -       45 (Lifetime Min/Max 36/48) 
194 Temperature_Celsius     0x0022   045   053   000    Old_age   Always       -       45 (0 19 0 0) 
195 Hardware_ECC_Recovered  0x001a   061   057   000    Old_age   Always       -       168854605 
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0 
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0 
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0 
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0 
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0 

As with the output of other smartctl commands we shall not go into great detail about the meaning of each and every line of the above output. The most important indicators in our experience are the Current_Pending_Sector count and the Offline_Uncorrectable count, which if non-zero usually indicate imminent drive failure, the UDMA_CRC_Error_Count which usually indicates a faulty cable or power supply, and the Temperature_Celsius which indicates the current temperature of the drive which can often be used to detect a failed fan or a failing disk bearing.

Configuring automatic SMART monitoring

Now that we have tested our harddisk(s) by hand and verified that they are currently free from errors and that the indicators mentioned above do not already indicate imminent failure we can use the smartd daemon to automatically perform self-tests, log the results and email alerts should any of the values we mentioned above change or exceed some predefined threshold.

Our first task is to move the default configuration that was installed by the smartmontools package as it is more suitable as an example than an actual working configuration.

lisa mv /etc/smartd.conf /etc/smartd.conf.example

We can now create a new configuration file for the smartd daemon based on the example given below.

/etc/smartd.conf
/dev/sda -l error -l selftest   \   # Check error and selftest logs.
-H \ # Check SMART health status
-f \ # Check for "failure" of any usage attributes.
-p -u \ # Report pre-fail and usage attribute changes...
-I 190 -I 194 -I 9 \ # ...ignoring temperatures and power-on hours.
-W 0,1,50 \ # Log temperatures and email if over 50 degrees.
-s S/../.././01 \ # Perform a short self-test daily after 01:00
-s L/../../7/02 \ # Perform a long self-test on Sunday after 02:00
-m root \ # Send alert emails to root
-M test # Send a test email on smartd startup.