Before we install any packages we should ensure that the correct use-flags are configured so that all required functionality is made available and unnecessary functionality is not included. The smartmontools package and its dependencies provide a variety of use-flags only some of which will be discussed further here. As usual feel free to add and remove use-flags at will although the minimum set which are required for using this guide in its entirety are shown below.
Once you are confident that the correct use-flags are set for the smartmontools package, and any dependencies it may require, you can proceed with the installation by issuing the emerge command shown below.
Now that we have the tools we require to initiate SMART tests and view the results of those tests we can begin by checking that our harddisk drives are SMART capable, which they will be unless they are very old indeed, and that SMART support is enabled for these drives.
The command shown below uses the smartctl application to interrogate the first SATA, SCSI or SAS harddisk. The information displayed in our example gives an idea of what to expect although clearly it will be different unless you are using the same make and model of drive.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda ES
Device Model: ST3250620NS
Serial Number: 5QE53KR5
Firmware Version: 3.AEK
User Capacity: 250,059,350,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Mar 22 15:23:28 2010 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
The most significant two lines of output, as far as we are concerned, are the last two. The above example shows output from a system where SMART support is both available and enabled. If the last line indicates that SMART support is disabled, but the line above indicates that SMART support is available, then the command below should be issued to enable SMART support. If, on the other hand, SMART support is unavailable you should check the settings in the BIOS and enable SMART support before performing the above test again.
Assuming that our target harddisk either already had SMART support enabled, or we have successfully enabled it in the above step, we can check to see what capabilities are offered by our particular make and model of device. The example command below shows how.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 92) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
As you can see the output from the above command is extensive. We shall not go into great detail here explaining what every line of the above output means, there are plenty of resources available on the Internet should a detailed explanation be of interest, instead we shall simply direct your attention to the last few lines of output which allow us to determine what tests we can perform on our drives and what data we can expect from those tests.
The most interesting to us, at least for the scope of this guide, are those indicating the recommended polling time for both short and long self-test routines and whether the drive exposes any SCT capabilities.
Now that we know that a short self-test routine is available, and that the recommended polling time for that test is one minute, we can initiate such a short self-test using the command shown below. You can also see that the smartctl application will give an estimate of how long the test will take to complete, in our case 430 seconds or about 7 minutes.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 430 seconds for test to complete.
Test will complete after Mon Mar 22 15:32:09 2010
Use smartctl -X to abort test.
As we have just initiated a short self-test routine we probably want to see what the result of that test was. The command below will display the self-test log maintained by the drive.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short Offline Completed without error 00% 37 -
Hopefully you will see something similar to that shown above indicating that the short test routine has completed without error.
Assuming that the short self-test routine completed without error we can initiate a more thorough self-test routine with the command shown below. As you can see this more thorough test will take considerably longer than the short self-test in the previous example.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 92 minutes for test to complete.
Test will complete after Mon Mar 22 17:04:29 2010
Use smartctl -X to abort test.
Given that the long self-test takes such a long time you may be interested enough in its progress to want to check that it is being performed and if any errors have been located thus far. The same command as before can be issued while a test is running and will provide output similar to that below. In this example there is less than 90% of the test remaining, the test is in progress and no errors have been located.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short Offline Completed without error 00% 37 -
# 2 Extended offline Self-test routine in progress 90% 38 -
Once the test has completed the new log entry will be moved to the top of the list, as shown in the example below.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 38 -
# 2 Short Offline Completed without error 00% 37 -
Now that we have performed both a short and a long self-test routine there is one more command which may be of interest. The command shown below will display all the SMART parameters recorded by the drive in a tabular form providing much more detailed information than the simple pass or fail provided in the self-test logs. The example below shows information from the same drive as above after it has been in use for some time.
smartctl version 5.38 [x86_64-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 111 097 006 Pre-fail Always - 155501215
3 Spin_Up_Time 0x0003 095 095 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 14
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 085 060 030 Pre-fail Always - 382279888
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7461
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 14
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 055 047 045 Old_age Always - 45 (Lifetime Min/Max 36/48)
194 Temperature_Celsius 0x0022 045 053 000 Old_age Always - 45 (0 19 0 0)
195 Hardware_ECC_Recovered 0x001a 061 057 000 Old_age Always - 168854605
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 TA_Increase_Count 0x0032 100 253 000 Old_age Always - 0
As with the output of other smartctl commands we shall not go into great detail about the meaning of each and every line of the above output. The most important indicators in our experience are the Current_Pending_Sector count and the Offline_Uncorrectable count, which if non-zero usually indicate imminent drive failure, the UDMA_CRC_Error_Count which usually indicates a faulty cable or power supply, and the Temperature_Celsius which indicates the current temperature of the drive which can often be used to detect a failed fan or a failing disk bearing.
Now that we have tested our harddisk(s) by hand and verified that they are currently free from errors and that the indicators mentioned above do not already indicate imminent failure we can use the smartd daemon to automatically perform self-tests, log the results and email alerts should any of the values we mentioned above change or exceed some predefined threshold.
Our first task is to move the default configuration that was installed by the smartmontools package as it is more suitable as an example than an actual working configuration.
We can now create a new configuration file for the smartd daemon based on the example given below.
/dev/sda -l error -l selftest \ # Check error and selftest logs.
-H \ # Check SMART health status
-f \ # Check for "failure" of any usage attributes.
-p -u \ # Report pre-fail and usage attribute changes...
-I 190 -I 194 -I 9 \ # ...ignoring temperatures and power-on hours.
-W 0,1,50 \ # Log temperatures and email if over 50 degrees.
-s S/../.././01 \ # Perform a short self-test daily after 01:00
-s L/../../7/02 \ # Perform a long self-test on Sunday after 02:00
-m root \ # Send alert emails to root
-M test # Send a test email on smartd startup.