EasyManuals Logo

Nvidia DGX H100 Service Manual

Nvidia DGX H100
146 pages
To Next Page IconTo Next Page
To Next Page IconTo Next Page
To Previous Page IconTo Previous Page
To Previous Page IconTo Previous Page
Page #44 background imageLoading...
Page #44 background image
NVIDIA DGX H100 Service Manual
Identifying the Failed NVMe from the Console
To identify the failed data drive, you can use the nvsm command:
sudo nvsm show health
View the command output and look for drive alerts to identity the failed drive.
Alternatively, you can use the BMC web user interface to access the Sensor screen, the IPMI event
log, and the System log to identify issues with the U.2 drives.
6.3. Identifying the NVMe Manufacturer and
Model
Use the nvsm command to display the drive information:
sudo nvsm show ∕systems∕localhost∕storage∕drives∕nvmeXn1
Replace X in the preceding command with the number that corresponds to the Linux device
name for the failed drive.
Example Output
∕systems∕localhost∕storage∕drives∕nvme5n1
Properties:
PhysicalLocation_Info = SlotU.2_Slot3
BlockSizeBytes = 512
SerialNumber = 22L0A01WT2N8
Model = KCM6DRUL3T84
Revision = 0107
Manufacturer = KIOXIA Corporation
Status_State = Enabled
Status_Health = OK
Name = nvme5n1
MediaType = SSD
EncryptionStatus = Unlocked
CapacityBytes = 3840755982336
Id = nvme5n1
Targets:
Verbs:
cd
set
show
Refer to the Manufacturer and Model elds in the output. Request a replacement NVMe from
NVIDIA Enterprise Support, specifying this information.
38 Chapter 6. U.2 NVMe Cache Drive Replacement

Table of Contents

Other manuals for Nvidia DGX H100

Questions and Answers:

Question and Answer IconNeed help?

Do you have a question about the Nvidia DGX H100 and is the answer not in the manual?

Nvidia DGX H100 Specifications

General IconGeneral
GPU8x NVIDIA H100 Tensor Core GPUs
GPU Memory640 GB HBM3 (80GB per GPU)
System Memory2 TB DDR5
Form Factor6U rackmount
Storage30TB NVMe SSD
Power Supply10kW
InterconnectNVLink 4.0

Related product manuals