EasyManua.ls Logo

Nvidia DGX-1 User Manual

Nvidia DGX-1
120 pages
To Next Page IconTo Next Page
To Next Page IconTo Next Page
To Previous Page IconTo Previous Page
To Previous Page IconTo Previous Page
Page #100 background imageLoading...
Page #100 background image
Maintaining and Servicing the NVIDIA DGX-1
www.nvidia.com
NVIDIA DGX-1 DU-08033-001 _v13.1|96
The output should be a list of lb_ and mlx_ driver components.
Example:
ib_ucm 20480 0
ib_ipoib 131072 0
ib_cm 45056 3 rdma_cm,ib_ucm,ib_ipoib
ib_uverbs 73728 2 ib_ucm,rdma_ucm
ib_umad 24576 0
mlx5_ib 192512 0
mlx4_ib 192512 0
ib_sa 36864 5 rdma_cm,ib_cm,mlx4_ib,rdma_ucm,ib_ipoib
ib_mad 57344 4 ib_cm,ib_sa,mlx4_ib,ib_umad
ib_core 143360 13
rdma_cm,ib_cm,ib_sa,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
ib_addr 20480 3 rdma_cm,ib_core,rdma_ucm
ib_netlink 16384 3 rdma_cm,iw_cm,ib_addr
mlx4_core 344064 2 mlx4_en,mlx4_ib
mlx5_core 524288 1 mlx5_ib
mlx_compat 16384 18
rdma_cm,ib_cm,ib_sa,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_netlink,ib_addr,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
3.
Verify that the OFED software was installed correctly.
$ modinfo mlx5_core | grep -i version | head -1
Example output:
Version : 3.4-1.0.0
DGX-1 OS release 1.0 should have OFED software 3.2.
DGX-1 OS release 2.0 should have OFED software 3.4.
4.
Restart the InfiniBand services so that the new card is recognized.
a) Restart the InfiniBand service.
$ sudo service openibd restart
b) Restart the Service Manager service.
$ sudo service opensmd restart
c) Verify that the service has started.
$ service openibd status
openibd start/running
$ service opensmd status
OpenSM is running...
d) If the services do not start, verify
That the drivers are loaded according to step 3.
That the associated cables are connected to the InfiniBand ports.
The state of ibstat (refer to step 7)
Whether errors are reported in /var/log/syslog.
If these steps do not indicate a problem and yet the services still do not start,
contact NVIDIA Enterprise Support and obtain an RMA for the card.
5.
Verify the firmware version.
$ cat /sys/class/infiniband/mlx5*/fw_ver
Example output:

Table of Contents

Question and Answer IconNeed help?

Do you have a question about the Nvidia DGX-1 and is the answer not in the manual?

Nvidia DGX-1 Specifications

General IconGeneral
RAM512 GB DDR4
Form Factor4U Rackmount
CPU2 x Intel Xeon E5-2698 v4
Storage4 x 1.92 TB NVMe SSD
NetworkDual 10 GbE
InterconnectNVIDIA NVLink
Power Supply3200W (redundant)
Operating SystemUbuntu Linux with NVIDIA DGX software stack
Dimensions19 in
GPU8x NVIDIA Tesla P100 GPUs

Summary

Introduction to the NVIDIA DGX-1 Deep Learning System

Hardware Specifications

Detailed specifications of the DGX-1's hardware components.

Installation and Setup

Installing the DGX-1 Into a Rack

Procedures for mounting the DGX-1 into a server rack.

Connecting the Network Cables

Guide for connecting Ethernet and IPMI network cables.

Setting Up the DGX-1

Initial system configuration process after powering on the DGX-1.

Preparing for Using Docker Containers

Installing Docker and NVIDIA Docker on DGX OS Server Software 2.x or Earlier

Installing Docker and NVIDIA Docker for older DGX OS versions.

Configuring Docker IP Addresses

Configuring Docker network IP addresses to prevent conflicts.

Configuring and Managing the DGX-1

Using the BMC

Introduction to the Baseboard Management Controller (BMC) for system management.

Configuring a Static IP Address for the BMC

Setting a static IP address for the BMC.

Configuring Static IP Addresses for the Network Ports

Setting static IP addresses for network interfaces.

Maintaining and Servicing the NVIDIA DGX-1

Restoring the DGX-1 Software Image

Procedures to restore the DGX-1 software image to factory settings.

Updating the System BIOS

Remotely updating the system BIOS via the BMC.

Updating the BMC

Remotely updating the BMC firmware using the IPMI port.

Replacing the System and Components

Overview of replaceable components and RMA process.

Installing Software on Air-Gapped NVIDIA DGX-1 Systems

Safety

Safety Warnings and Cautions

General safety warnings and cautions for DGX-1 operation.

Related product manuals