top of page
Writer's pictureJoseph Jones

NVMe SSD health

Updated: Mar 5, 2023

How do you know if your NVMe SSD is healthy? You can check the health reported by different OS and SSD utilities but most of these vaguely define the health they report, if they define it at all. For example, several define health as a function of SMART attributes but leave out the details.


This post discusses how the NVMe Tools package defines NVMe health in painstaking detail. If you are not interested in the details, just know the NVMe Tools package defines health as three possible states:

  • Good indicates the drive is operating normally and no action is required

  • Suspect indicates the drive MAY NOT be operating normally and action MAY be required

  • Critical indicates the drive is NOT operating normally and action SHOULD be taken

The NVMe View command (viewnvme)

The NVMe Tools package has a command, viewnvme, that displays NVMe information as a web page. The screenshot below is example output of this command. It shows an NVMe drive in critical health because it has operated above the critical temperature threshold for 134 minutes. The NVMe specification states operating above this critical threshold risks failure and permanent damage. Therefore action should be taken to prevent the NVMe from operating above this critical temperature.



The above screenshot also shows the PCIe bandwidth, speed, and errors are suspect. Each of these will be discussed below.


How is health determined?

Health is determined by checking the values of several NVMe parameters. These are summarized in the table at the end of the post. Each of these "health parameter values" is classified as Critical, Suspect or Good. For example, when Percentage Used exceeds 100% it is classified as Suspect. The overall health is defined as:

  • Health is Critical if any Critical Parameter Value is true

  • Health is Suspect if any Suspect and no Critical Parameter Values are true

  • Health is Good if no Suspect and no Critical Health Parameter Values are true

The health parameters are organized into the groups shown above in the screenshot. Let's look at each of these groups and their parameters in detail. This is the painful part.


Usage

The more NVMe drives are used the closer they are to wearing out. This Usage group checks two SMART attributes that indicate the amount of NVMe use.


The Percentage Used SMART attribute is an estimate of the percentage of life used. This value can exceed 100% because it is based on an estimate of how long the drive will last. When Percentage Used exceeds 100%, NVMe Tools classifies it as Suspect so the end-user can decide whether to replace the drive.


The Available Spare SMART attribute dropping below the Available Spare Threshold SMART attribute is classified as Critical. This indicates the NVMe is wearing out, has used up most of its spare memory, and is at risk to run out of spare memory.


SMART Errors

Each of the below errors from the SMART/health log indicate a major component of the NVMe has either failed or become unreliable. NVMe Tools classifies all of these as Critical.

  • NVM subsystem unreliable

  • Persistent memory unreliable (optional beginning with standard 1.4.0)

  • Media in read-only (data cannot be written to the drive)

  • Volatile memory backup failure

  • Any unrecoverable data errors (greater than 0)

Diagnostic Self-tests

The diagnostic self-test is an optional feature for running a self-test on the drive. The NVMe Base Specification defines it as “… a diagnostic testing sequence that tests the integrity and functionality of the controller and may include testing of the media associated with namespaces.”


The diagnostic self-test is run by the NVMe owner. The results of these self-tests, up to the last 20, are reported in the Device Self-Test Log (Log Page 6). Since most NVMe owners don't know about these self-tests, it's common for the log to have no results. If there are results, any self-test failure is classified as Critical.


The Check NVMe command (checknvme)

The NVMe Tools checknvme command works the same as viewnvme except that is runs the diagnostic self-test before reading the NVMe information.


Temperature

If the composite temperature exceeds the lowest throttle threshold it is classified as Suspect. The lowest threshold can be one of the two host controlled thresholds (TMT1, TMT2) or the Composite Warning Threshold (WCTEMP).


If the composite temperature exceeds the Critical Thresold (CCTEMP) it is classified as Critical. The NVMe Base Specification states this about operating above CCTEMP “... indicates a critical overheating condition (e.g., may prevent continued normal operation, possibility of data loss, automatic device shutdown, extreme performance throttling, or permanent damage)”.


At this time, NVMe Tools does not check the individual Temperature Sensor values against their over/under thresholds.


Time Throttled

When an NVMe drive exceeds a predetermined temperature threshold it throttles performance in an attempt to reduce the temperature. Typically, there are multiple thresholds where exceeding each subsequent threshold results in lower performance. Most systems, especially laptops, are designed to throttle under heavy IO workloads.


Since some throttling is expected the question becomes how much throttling is acceptable. Only the NVMe owner can answer this. That said, the NVMe Tools package sets the following limits. If the NVMe is throttled more than 1% of time it is classified it as Suspect. If it is throttled more than 10% of the time it is classified as Critical.


If the amount of time operating above the critical threshold exceeds 1 minute it is classified as Suspect. If it exceeds 10 minutes it is classified as Critical.


PCI Express Bandwidth

PCIe bandwidth is the product of the PCIe speed and width. Running at a lower speed or width than rated results in lower PCIe bandwidth than expected. There are multiple reasons why this could happen. Some indicate a serious health problem while others indicate no health problem.


One common reason is the NVMe drive is installed in a lower slot. For example, installing a PCIe Gen4 NVMe drive into a PCIe Gen3 slot results in the link running PCIe Gen3 speed. This is the reason for the Suspect PCIe Speed and Bandwidth shown in the screenshot above. Since this was done intentionally by the owner (me) there is no need to take action.


Some platforms reduce PCIe speed and width to save power during periods of low IO traffic. This is done intentionally and does not indicate a health problem.


Less common reasons that indicate a health problem include a poor electrical connection or functional bug with the host or NVMe.


In summary, a PCIe link running at lower bandwidth than rated is classified as Suspect because only the NVMe owner can determine if this is intentional.


Persistent Events

The Persistent Event Log is an optional log introduced in 1.4.0. This log contains timestamped events where some may indicate a health problem. The below events indicate a major NVMe component as failed or become unreliable, they are all classified as Critical.

  • NVM subsystem unreliable

  • Persistent memory unreliable (optional beginning with standard 1.4.0)

  • Media in read-only

  • Volatile memory backup failure

  • Controller fatal status

  • Media and data integrity errors

  • PCIe errors (except correctable errors)

The following errors are classified as Suspect:

  • PCIe correctable errors

Correctable PCIe errors do not result in data loss and are allowed by the specification at a surprisingly frequent rate. That said, there is a small performance penalty to the host when the error handler runs. It is also possible the occurrence of several PCIe correctable errors of a certain type indicate un-correctable PCIe error are likely to occur.


PCIe correctable errors are classified as Suspect because the end-user must decide if the rate and type of the error warrants action. For example, in the screenshot above, the PCIe correctable errors were found to be unsupported requests indicating they are a host software problem unlikely to be a drive issue.


OS Errors

So, there are a lot of NVMe parameters to check when determining the overall health. But if you only look at these parameters you might be missing a serious health problem. At a minimum I recommend checking the OS logs (Windows System Events and Linux dmesg) for the following:

  • Driver timeouts (controller resets)

  • Disk IO errors

  • PCIe errors on the root port

  • PCIe errors (if drive doesn't support the Persistent Event Log)

At this time, NVMe Tools does not check the OS logs leaving it up to the NVMe owner.


Actions

When the health indicates action should be taken the following steps are recommended:

  • Immediately backup data and continue regular backups until issue resolved

  • Update system BIOS

  • Update NVMe firmware

For PCI Express errors…

  • Clean gold fingers, re-install and verify mechanically secured

  • Disable ASPM

For excessive or critical throttling…

  • Inspect system for blocked air flow and/or dislodged ducting

  • Replace broken fans (note broken fans can still spin)

If the above actions don’t resolve the issue it may be time to replace the drive or the host.


Health Parameter Values

PARAMETER VALUES

HEALTH

GROUP

Percentage Used > 100%

Suspect

Usage

Available Spare < Available Spare Threshold

Critical

Usage

NVM Subsystem Unreliable

Critical

SMART

Persistent Memory Unreliable

Critical

SMART

Media Read-only​

Critical

SMART

Volatile Memory Backup Failure

Critical

SMART

Unrecoverable Data Errors

Critical

SMART

Self-test Fails > 0​

Critical

Self-test

Percent throttled > 10%

Critical

Time Throttled

Percent throttled > 1%

Suspect

Time Throttled

Time above critical threshold > 10 minutes

Critical

Time Throttled

Time above critical threshold > 1 minutes

Suspect

Time Throttled

PCIe speed less than rated

Suspect

PCIe Bandwidth

PCIe width less than rated

Suspect

PCIe Bandwidth

​Composite Temperature above WCTEMP/TMT1/TMT2

Suspect

Temperature

Composite Temperature above CCTEMP

Critical

Temperature

SMART Errors

Critical

Persistent Events

PCIe Correctable Errors

Suspect

Persistent Events

PCIe Errors (except correctable)

Critical

Persistent Events

Media and Data Errors

Critical

Persistent Events

Fatal Controller Errors

Critical

Persistent Events

Driver Timeout (Controller Reset)

Critical

OS Errors

Disk Errors

Critical

OS Errors

Root Port or NVMe Uncorrectable Errors

Critical

OS Errors

Root Port or NVMe PCIe Correctable Errors

Suspect

OS Errors





11,650 views

Recent Posts

See All

Windows OS fails second self-test?

Something odd occurs when running consecutive self-tests on Windows OS. The second self-test fails if started within 10 minutes of the...

Comments


Commenting has been turned off.
bottom of page