Chapter 12. DIMM Replacement
12.1. DIMM Replacement Overview
This is a high-level overview of the procedure to replace a dual inline memory module (DIMM) on the
DGX A100 system.
1. Use the nvsm health commands to identify the failed DIMM
2. Get a replacement DIMM from NVIDIA Enterprise Support.
3. Shut down the system.
4. Label all motherboard tray cables and unplug them.
5. Remove the motherboard tray and place on a solid at surface.
6. Remove the motherboard tray lid.
7. Use the reference diagram on the lid of the motherboard tray to identify the failed DIMM.
8. Replace the bad DIMM with the new one.
9. Close the lid on the motherboard tray.
10. Insert the motherboard tray into the system.
11. Plug in all cables using the labels as a reference.
12. Power on the system.
13. Verify that all DIMMs are now healthy with nvsm.
12.2. Identifying the Failed DIMM
1. From the console, run the following nvsm command to identify memory alerts.
$ sudo nvsm show ∕systems∕localhost∕memory∕alerts
Alerts will appear under the Target section. For example.
Targets:
alert0
2. Get specic information about the memory alert.
The following example obtains information for alert0.
51