How do you know if your NVMe SSD is healthy? You can check the health reported by different OS and SSD utilities but most of these vaguely define the health they report, if they define it at all. For example, several define health as a function of SMART attributes but leave out the details.
This post discusses how the NVMe Tools package defines NVMe health in painstaking detail. If you are not interested in the details, just know the NVMe Tools package defines health as three possible states:
Good indicates the drive is operating normally and no action is required
Suspect indicates the drive MAY NOT be operating normally and action MAY be required
Critical indicates the drive is NOT operating normally and action SHOULD be taken
The NVMe View command (viewnvme)
The NVMe Tools package has a command, viewnvme, that displays NVMe information as a web page. The screenshot below is example output of this command. It shows an NVMe drive in critical health because it has operated above the critical temperature threshold for 134 minutes. The NVMe specification states operating above this critical threshold risks failure and permanent damage. Therefore action should be taken to prevent the NVMe from operating above this critical temperature.
The above screenshot also shows the PCIe bandwidth, speed, and errors are suspect. Each of these will be discussed below.
How is health determined?
Health is determined by checking the values of several NVMe parameters. These are summarized in the table at the end of the post. Each of these "health parameter values" is classified as Critical, Suspect or Good. For example, when Percentage Used exceeds 100% it is classified as Suspect. The overall health is defined as:
Health is Critical if any Critical Parameter Value is true
Health is Suspect if any Suspect and no Critical Parameter Values are true
Health is Good if no Suspect and no Critical Health Parameter Values are true
The health parameters are organized into the groups shown above in the screenshot. Let's look at each of these groups and their parameters in detail. This is the painful part.
Usage
The more NVMe drives are used the closer they are to wearing out. This Usage group checks two SMART attributes that indicate the amount of NVMe use.
The Percentage Used SMART attribute is an estimate of the percentage of life used. This value can exceed 100% because it is based on an estimate of how long the drive will last. When Percentage Used exceeds 100%, NVMe Tools classifies it as Suspect so the end-user can decide whether to replace the drive.
The Available Spare SMART attribute dropping below the Available Spare Threshold SMART attribute is classified as Critical. This indicates the NVMe is wearing out, has used up most of its spare memory, and is at risk to run out of spare memory.
SMART Errors
Each of the below errors from the SMART/health log indicate a major component of the NVMe has either failed or become unreliable. NVMe Tools classifies all of these as Critical.
NVM subsystem unreliable
Persistent memory unreliable (optional beginning with standard 1.4.0)
Media in read-only (data cannot be written to the drive)
Volatile memory backup failure
Any unrecoverable data errors (greater than 0)
Diagnostic Self-tests
The diagnostic self-test is an optional feature for running a self-test on the drive. The NVMe Base Specification defines it as “… a diagnostic testing sequence that tests the integrity and functionality of the controller and may include testing of the media associated with namespaces.”
The diagnostic self-test is run by the NVMe owner. The results of these self-tests, up to the last 20, are reported in the Device Self-Test Log (Log Page 6). Since most NVMe owners don't know about these self-tests, it's common for the log to have no results. If there are results, any self-test failure is classified as Critical.
The Check NVMe command (checknvme)
The NVMe Tools checknvme command works the same as viewnvme except that is runs the diagnostic self-test before reading the NVMe information.
Temperature
If the composite temperature exceeds the lowest throttle threshold it is classified as Suspect. The lowest threshold can be one of the two host controlled thresholds (TMT1, TMT2) or the Composite Warning Threshold (WCTEMP).
If the composite temperature exceeds the Critical Thresold (CCTEMP) it is classified as Critical. The NVMe Base Specification states this about operating above CCTEMP “... indicates a critical overheating condition (e.g., may prevent continued normal operation, possibility of data loss, automatic device shutdown, extreme performance throttling, or permanent damage)”.
At this time, NVMe Tools does not check the individual Temperature Sensor values against their over/under thresholds.
Time Throttled
When an NVMe drive exceeds a predetermined temperature threshold it throttles performance in an attempt to reduce the temperature. Typically, there are multiple thresholds where exceeding each subsequent threshold results in lower performance. Most systems, especially laptops, are designed to throttle under heavy IO workloads.
Since some throttling is expected the question becomes how much throttling is acceptable. Only the NVMe owner can answer this. That said, the NVMe Tools package sets the following limits. If the NVMe is throttled more than 1% of time it is classified it as Suspect. If it is throttled more than 10% of the time it is classified as Critical.
If the amount of time operating above the critical threshold exceeds 1 minute it is classified as Suspect. If it exceeds 10 minutes it is classified as Critical.
PCI Express Bandwidth
PCIe bandwidth is the product of the PCIe speed and width. Running at a lower speed or width than rated results in lower PCIe bandwidth than expected. There are multiple reasons why this could happen. Some indicate a serious health problem while others indicate no health problem.
One common reason is the NVMe drive is installed in a lower slot. For example, installing a PCIe Gen4 NVMe drive into a PCIe Gen3 slot results in the link running PCIe Gen3 speed. This is the reason for the Suspect PCIe Speed and Bandwidth shown in the screenshot above. Since this was done intentionally by the owner (me) there is no need to take action.
Some platforms reduce PCIe speed and width to save power during periods of low IO traffic. This is done intentionally and does not indicate a health problem.
Less common reasons that indicate a health problem include a poor electrical connection or functional bug with the host or NVMe.
In summary, a PCIe link running at lower bandwidth than rated is classified as Suspect because only the NVMe owner can determine if this is intentional.
Persistent Events
The Persistent Event Log is an optional log introduced in 1.4.0. This log contains timestamped events where some may indicate a health problem. The below events indicate a major NVMe component as failed or become unreliable, they are all classified as Critical.
NVM subsystem unreliable
Persistent memory unreliable (optional beginning with standard 1.4.0)
Media in read-only
Volatile memory backup failure
Controller fatal status
Media and data integrity errors
PCIe errors (except correctable errors)
The following errors are classified as Suspect:
PCIe correctable errors
Correctable PCIe errors do not result in data loss and are allowed by the specification at a surprisingly frequent rate. That said, there is a small performance penalty to the host when the error handler runs. It is also possible the occurrence of several PCIe correctable errors of a certain type indicate un-correctable PCIe error are likely to occur.
PCIe correctable errors are classified as Suspect because the end-user must decide if the rate and type of the error warrants action. For example, in the screenshot above, the PCIe correctable errors were found to be unsupported requests indicating they are a host software problem unlikely to be a drive issue.
OS Errors
So, there are a lot of NVMe parameters to check when determining the overall health. But if you only look at these parameters you might be missing a serious health problem. At a minimum I recommend checking the OS logs (Windows System Events and Linux dmesg) for the following:
Driver timeouts (controller resets)
Disk IO errors
PCIe errors on the root port
PCIe errors (if drive doesn't support the Persistent Event Log)
At this time, NVMe Tools does not check the OS logs leaving it up to the NVMe owner.
Actions
When the health indicates action should be taken the following steps are recommended:
Immediately backup data and continue regular backups until issue resolved
Update system BIOS
Update NVMe firmware
For PCI Express errors…
Clean gold fingers, re-install and verify mechanically secured
Disable ASPM
For excessive or critical throttling…
Inspect system for blocked air flow and/or dislodged ducting
Replace broken fans (note broken fans can still spin)
If the above actions don’t resolve the issue it may be time to replace the drive or the host.
Health Parameter Values
PARAMETER VALUES | HEALTH | GROUP |
Percentage Used > 100% | Suspect | Usage |
Available Spare < Available Spare Threshold | Critical | Usage |
NVM Subsystem Unreliable | Critical | SMART |
Persistent Memory Unreliable | Critical | SMART |
Media Read-only | Critical | SMART |
Volatile Memory Backup Failure | Critical | SMART |
Unrecoverable Data Errors | Critical | SMART |
Self-test Fails > 0 | Critical | Self-test |
Percent throttled > 10% | Critical | Time Throttled |
Percent throttled > 1% | Suspect | Time Throttled |
Time above critical threshold > 10 minutes | Critical | Time Throttled |
Time above critical threshold > 1 minutes | Suspect | Time Throttled |
PCIe speed less than rated | Suspect | PCIe Bandwidth |
PCIe width less than rated | Suspect | PCIe Bandwidth |
Composite Temperature above WCTEMP/TMT1/TMT2 | Suspect | Temperature |
Composite Temperature above CCTEMP | Critical | Temperature |
SMART Errors | Critical | Persistent Events |
PCIe Correctable Errors | Suspect | Persistent Events |
PCIe Errors (except correctable) | Critical | Persistent Events |
Media and Data Errors | Critical | Persistent Events |
Fatal Controller Errors | Critical | Persistent Events |
Driver Timeout (Controller Reset) | Critical | OS Errors |
Disk Errors | Critical | OS Errors |
Root Port or NVMe Uncorrectable Errors | Critical | OS Errors |
Root Port or NVMe PCIe Correctable Errors | Suspect | OS Errors |
Comments