Maintaining and Servicing the NVIDIA DGX-1
www.nvidia.com
NVIDIA DGX-1 DU-08033-001 _v13.1|96
The output should be a list of lb_ and mlx_ driver components.
Example:
ib_ucm 20480 0
ib_ipoib 131072 0
ib_cm 45056 3 rdma_cm,ib_ucm,ib_ipoib
ib_uverbs 73728 2 ib_ucm,rdma_ucm
ib_umad 24576 0
mlx5_ib 192512 0
mlx4_ib 192512 0
ib_sa 36864 5 rdma_cm,ib_cm,mlx4_ib,rdma_ucm,ib_ipoib
ib_mad 57344 4 ib_cm,ib_sa,mlx4_ib,ib_umad
ib_core 143360 13
rdma_cm,ib_cm,ib_sa,iw_cm,nv_peer_mem,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_umad,ib_uverbs,rdma_ucm,ib_ipoib
ib_addr 20480 3 rdma_cm,ib_core,rdma_ucm
ib_netlink 16384 3 rdma_cm,iw_cm,ib_addr
mlx4_core 344064 2 mlx4_en,mlx4_ib
mlx5_core 524288 1 mlx5_ib
mlx_compat 16384 18
rdma_cm,ib_cm,ib_sa,iw_cm,mlx4_en,mlx4_ib,mlx5_ib,ib_mad,ib_ucm,ib_netlink,ib_addr,ib_core,ib_umad,ib_uverbs,mlx4_core,mlx5_core,rdma_ucm,ib_ipoib
3.
Verify that the OFED software was installed correctly.
$ modinfo mlx5_core | grep -i version | head -1
Example output:
Version : 3.4-1.0.0
DGX-1 OS release 1.0 should have OFED software 3.2.
DGX-1 OS release 2.0 should have OFED software 3.4.
4.
Restart the InfiniBand services so that the new card is recognized.
a) Restart the InfiniBand service.
$ sudo service openibd restart
b) Restart the Service Manager service.
$ sudo service opensmd restart
c) Verify that the service has started.
$ service openibd status
openibd start/running
$ service opensmd status
OpenSM is running...
d) If the services do not start, verify
‣
That the drivers are loaded according to step 3.
‣
That the associated cables are connected to the InfiniBand ports.
‣
The state of ibstat (refer to step 7)
‣
Whether errors are reported in /var/log/syslog.
If these steps do not indicate a problem and yet the services still do not start,
contact NVIDIA Enterprise Support and obtain an RMA for the card.
5.
Verify the firmware version.
$ cat /sys/class/infiniband/mlx5*/fw_ver
Example output: