NVIDIA DGX A100 Service Manual
$ sudo mdadm -D ∕dev∕md0
Normally, the output would show both drives (nvme0 and nvme1) in an active sync state. The
following example output shows only nvme1 in active sync, indicating that nvme0n1 is the failed
drive.
Number Major Minor RaidDevice State
0 259 2 0 active sync ∕dev∕nvme1n1p2
- 0 0 1 removed
3. Make a note of the device name for the failed drive (nvme0 or nvme1) and the device name for
the good drive (nvme0 or nvme1).
You will need this information when rebuilding the RAID 1 array after replacing the drive.
4. Obtain the replacement from NVIDIA Enterprise Support.
10.3. Replacing the M.2 NVMe Drive
Before attempting to replace one of the M.2 NVMe drives, be sure to have performed the following:
▶ Determined the location ID of the faulty M.2 NVMe drive.
▶ Obtained the replacement M.2 NVMe drive and have saved the packaging for use when returning
the faulty drive.
M.2 NVMe Drives:
▶ 40GB model
▶ PCIe Bus: 22 -> /dev/nvme1
▶ PCIe Bus: 23 -> /dev/nvme2
▶ 80GB model
▶ PCIe Bus: 22 -> /dev/nvme2
▶ PCIe Bus: 23 -> /dev/nvme3
Caution: Static Sensitive Devices: - Be sure to observe best practices for electrostatic discharge
(ESD) protection. This includes making sure personnel and equipment are connected to a common
ground, such as by wearing a wrist strap connected to the chassis ground, and placing components
on static-free work surfaces.
1. Back up any critical data to a network shared volume or some other means of backup.
2. If not already done, mark the drive as failed, then remove the failed drive from the array by issuing
the following (replacing X with the failed drive identier - 0 or 1).
$ sudo mdadm --manage ∕dev∕md0 --fail ∕dev∕nvme<X>n1
$ sudo mdadm --manage ∕dev∕md0 --remove ∕dev∕nvme<X>n1
3. Power down the system.
40 Chapter 10. M.2 NVMe Boot Drive Replacement