Home
Nvidia
Server
DGX-2
Page 93
Nvidia DGX-2 - Page 93
147 pages
Manual
To Next Page
To Next Page
To Previous Page
To Previous Page
Loading...
gp
u_i
nd
ex
Nu
m
be
r
of
G
P
U
s
Al
l
ow
e
d
v
a
l
u
e
s
for
gp
u
_
in
d
ex
1
0,
1
,2
,
3
,4
,
5
,
6
,
7
,8
,
9
,
1
0
,
1
1
,
12
,
1
3,
1
4,
1
5
2
0,
2
,4
,
6
,8
,
1
0
,
1
2
,1
4
4
0,
4
,8
,
1
2
8
0,8
16
0
--
i
ma
ge
Manag
in
g
t
h
e
I
ma
ges
--
u
se
r-
da
t
a
Us
i
n
g
cl
o
u
d
-
in
i
t
t
o
I
n
i
t
i
al
i
z
e
t
h
e
Gu
e
st
VM
--
m
et
a-
da
t
a
Us
i
n
g
cl
o
u
d
-i
n
i
t
t
o
In
i
t
i
al
i
ze
t
he
Gu
e
st
VM
[o
p
t
i
o
n
s
]
Co
m
m
a
n
d
Help
:
[
su
do
]
nv
i
di
a-
vm
cr
e
a
te
--help
sudo
sudo
Co
m
m
a
n
d
Example
s
:
ļ
sudo
nv
i
d
i
a-
v
m
c
r
ea
t
e
--
g
p
u-
c
ou
n
t
4
--
gpu
-
in
d
ex
1
2
92
94
Table of Contents
Main Page
Chapter 1. Introduction to the NVIDIA DGX-2/2H System
9
1.1 About this Document
10
1.2 Hardware Overview
10
1.2.1 Major Components
10
1.2.2 Other Components not in Exploded View
11
1.2.3 Mechanical Specifications
12
1.2.4 Power Specifications
12
1.2.4.1 Support for Degraded Power
12
1.2.4.2 DGX-2 Power Cord
12
1.2.5 Environmental Specifications
13
1.2.6 Front Panel Connections and Controls
14
1.2.7 Rear Panel Connections and Controls
15
1.2.7.1 With EMI Shield Installed
15
1.2.7.2 With EMI Shield Removed
15
1.2.8 Motherboard Tray Ports and Controls
16
1.3 Network Ports
17
1.4 InfiniBand Cables
18
1.5 Recommended Ports to Use for External Storage
19
1.6 DGX OS Software
20
1.7 Additional Documentation
20
1.8 Customer Support
21
1.8.1 NVIDIA Enterprise Support Portal
21
1.8.2 NVIDIA Enterprise Support Email
21
1.8.3 NVIDIA Enterprise Support - Local Time Zone Phone Numbers
21
Chapter 2. Connecting to the DGX-2 Console
22
2.1 Direct Connection
23
2.2 Remote Connection through the BMC
24
2.3 SSH Connection
26
Chapter 3. Setting Up the DGX-2 System
27
Chapter 4. Quick Start Instructions
30
4.1 Registration
30
4.2 Installation and Configuration
31
4.3 Obtaining an NVIDIA GPU Cloud Account
31
4.4 Turning the DGX-2 On and OFF
31
4.4.1 Startup Considerations
31
4.4.2 Shutdown Considerations
32
4.5 Verifying Basic Functionality
32
4.6 Running NGC Containers with GPU Support
33
4.6.1 Using Native GPU Support
33
4.6.2 Using the NVIDIA Container Runtime for Docker
34
Chapter 5. Network Configuration
36
5.1 BMC Security
36
5.2 Configuring Network Proxies
36
5.2.1 For the OS and Most Applications
36
5.2.2 For apt
37
5.2.3 For Docker
37
5.3 Configuring Docker IP Addresses
37
5.4 Opening Ports
38
5.5 Connectivity Requirements
39
5.6 Configuring Static IP Address for the BMC
39
5.6.1 Configuring a BMC Static IP Address Using ipmitool
40
5.6.2 Configuring a BMC Static IP Address Using the System BIOS
40
5.6.3 Configuring a BMC Static IP Address Using the BMC Dashboard
43
5.7 Configuring Static IP Addresses for the Network Ports
44
5.8 Switching Between InfiniBand and Ethernet
46
5.8.1 Starting the Mellanox Software Tools
46
5.8.2 Determining the Current Port Configuration
48
5.8.3 Switching the Port from InfiniBand to Ethernet
49
5.8.4 Switching the Port from Ethernet to InfiniBand
50
Chapter 6. Configuring Storage ā NFS Mount and Cache
52
Chapter 7. Special Features and Configurations
54
7.1 Setting MaxQ/MaxP
54
7.1.1 MaxQ
54
7.1.2 MaxP
55
7.1.3 Determining GPU Power Mode
55
7.2 Managing the DGX Crash Dump Feature
56
7.2.1 Using the Script
56
7.2.2 Connecting to Serial Over LAN
56
7.3 Using PCIe Access Control Services
57
7.4 Managing CPU Mitigations
57
7.4.1 Determining the CPU Mitigation State of the System
57
7.4.2 Disabling CPU mitigations
58
7.4.3 Re-enabling CPU Mitigations
59
7.5 Using the DGX System in GPU Degraded Mode
59
Chapter 8. Restoring the DGX-2 Software Image
61
8.1 Obtaining the DGX-2 Software ISO Image and Checksum File
62
8.2 Re-Imaging the System Remotely
62
8.3 Creating a Bootable Installation Medium
63
8.3.1 Creating a Bootable USB Flash Drive by Using the dd Command
63
8.3.2 Creating a Bootable USB Flash Drive by Using Akeo Rufus
65
8.4 Re-Imaging the System From a USB Flash Drive
66
8.5 Retaining the RAID Partition While Installing the OS
67
Chapter 9. Updating the DGX OS Software
68
9.1 Connectivity Requirements For Software Updates
68
9.2 Update Instructions
69
Chapter 10. Updating Firmware
70
10.1 General Firmware Update Guidelines
70
10.2 Obtaining the Firmware Update Container
71
10.3 Querying the Firmware Manifest
72
10.4 Querying the Currently Installed Firmware Versions
72
10.5 Updating the Firmware
73
10.5.1 Command Syntax
73
10.5.2 Updating All Firmware components
73
10.5.3 Updating Specific Firmware Components
75
10.6 Additional Options
76
10.6.1 Forcing the Firmware Update
76
10.6.2 Updating the Firmware Non-interactively
76
10.7 Command Summary
76
10.8 Removing the Container
77
10.9 Using the .run File
77
10.10 Updating Secondary Firmware Images
78
10.10.1 Updating the Secondary BMC
78
10.10.2 Updating the Secondary SBIOS
78
10.11 Troubleshooting
79
10.11.1 Redundant PSU fails to update
79
Chapter 11. Using the BMC
80
11.1 Connecting to the BMC
80
11.2 Overview of BMC Controls
80
11.2.1 QuickLinks ā¦
81
11.2.2 Sensor
82
11.2.3 FRU Information
82
11.2.4 Logs & Reports
82
11.2.5 Settings
82
11.2.6 Remote Control
82
11.2.7 Power Control
82
11.2.8 Maintenance
83
11.3 Creating a Unique BMC Password
84
11.4 Updating the SBIOS
85
Chapter 12. Using DGX-2 System in KVM Mode
87
12.1 Overview
87
12.1.1 About NVIDIA KVM
87
12.1.2 About the Guest GPU VM (Features and Limitations)
89
12.1.3 About nvidia-vm
89
12.2 Converting the DGX-2 System to a DGX-2 KVM Host
90
12.2.1 Getting Updated KVM Packages
91
12.2.2 Restoring to Bare Metal
91
12.3 Launching a Guest GPU VM Instance
91
12.3.1 Determining the Guest GPU VMs on the DGX-2 System
92
12.3.2 Creating a VM Using Available GPUs
92
12.3.2.1 Using cloud-init to Initialize the Guest VM
94
12.4 Stopping, Restarting, and Deleting a Guest GPU VM
95
12.4.1 Shutting Down a VM
95
12.4.2 Starting an Inactive VM
95
12.4.3 Deleting a VM
96
12.4.4 Stopping a VM
96
12.4.5 Rebooting a VM
97
12.5 Connecting to Your Guest GPU VM
97
12.5.1 Determining IP Addresses
97
12.5.2 Connecting to the Guest GPU VM
97
12.6 Making Your VM More Secure
98
12.6.1 Changing Login Credentials
98
Using cloud-init
98
12.6.2 Adding SSH Keys
99
Using cloud-init
99
12.7 Managing Images
99
12.7.1 Installing Images
99
12.7.2 Viewing a List of Installed Images
100
12.7.3 Viewing Image Usage
100
12.7.4 Uninstalling Images
101
12.8 Using the Guest OS Drives and Data Drives
101
12.8.1 Guest OS Drive
102
12.8.2 Data Drives
102
12.8.3 Storage Pool Examples
102
12.9 Network Configuration
104
Macvtap (bridged mode)
104
Macvtap (VEPA mode)
104
Private Network
104
12.10 Updating the Software
105
12.10.1 Updating the Host OS
105
12.10.2 Updating the Guest VM OS
106
12.11 Supplemental Information
107
12.11.1 Resource Allocations
107
12.11.2 Resource Management
107
12.11.3 NVIDIA KVM Security Considerations
108
12.11.4 Launching VMs in Degraded Mode
108
12.11.4.1 When the DGX-2 is Put in Degraded Mode
108
12.11.4.2 Performing a GPU Health Check
108
12.11.4.3 Getting GPU Health Information from Within the VM
109
12.11.4.4 Creating VMs with the DGX-2 System in Degraded Mode
110
12.11.4.5 Restarting a VM After the System or VM Crashes
111
12.11.4.6 Restoring a System from Degraded Mode
111
12.12 Troubleshooting Tools
111
12.12.1 Reporting Issues and Collecting Information
112
12.12.2 How to Detect Guest Launch/Shutdown Issues
112
12.12.2.1.1 virt-install.log
112
12.12.2.1.2 Guest VM log file
112
12.12.3 How to Detect Guest Networking Issues
113
12.12.3.1.1 Verify Enabled Networks
113
12.12.3.1.2 Verify Guest VM IP Address
113
12.12.3.1.3 Configuring VMs for Host-to-VM Network Connectivity
113
12.12.4 How to Detect GPU Issues in a Guest VM
114
12.12.5 How to Detect Storage Issues in a Guest VM or on the KVM Host
114
12.12.5.1 How to Check Storage Health in the KVM Host
114
12.12.5.1.1 To check health of SSDs:
114
12.12.5.1.2 To check health of the RAID Array:
115
12.12.5.1.3 To check health of the KVM Storage Pools:
116
12.12.5.2 How to Check Storage Health in KVM Guests
116
12.12.6 Using libguestfs-tools to Gather Log Files and Perform Console Scraping
117
12.12.6.1.1 Partial List of tools for checking log files
117
12.12.6.1.2 Partial List of tools for checking block devices and filesystems
117
12.12.6.1.3 Examples
117
12.12.7 Known Issues
118
12.12.8 Reference Resources
118
Chapter 13. Replacing Components
119
Chapter 14. Security
120
Chapter 15. Secure Data Deletion of SSDs
121
Appendix A. Installing Software on Air-gapped DGX-2 Systems
123
A.1 Installing NVIDIA DGX-2 Software
123
A.2 Re-Imaging the System
123
A.3 Creating a Local Mirror of the NVIDIA and Canonical Repositories
124
A.3.1 Create the Mirror in a DGX OS 4 System
124
A.3.2 Configure the Target System
126
A.4 Installing Docker Containers
129
Appendix B. Supplemental KVM Information
131
B.1 Using Cloud-init
131
Setting Up the Cloud-Config File
131
Setting Up Instance-Data
132
Appendix C. Safety
133
C.1 Safety Information
133
C.2 Safety Warnings and Cautions
134
C.3 Intended Application Uses
135
C.4 Site Selection
135
C.5 Equipment Handling Practices
135
C.6 Electrical Precautions
136
C.6.1 Power and Electrical Warnings
136
C.6.2 Power Cord Warnings
136
C.7 System Access Warnings
137
C.8 Rack Mount Warnings
138
C.9 Electrostatic Discharge (ESD)
138
C.10 Other Hazards
139
C.10.1 CALIFORNIA DEPARTMENT OF TOXIC SUBSTANCES CONTROL:
139
C.10.2 NICKEL
139
C.10.3 Battery Replacement
139
C.10.4 Cooling and Airflow
140
Appendix D. Compliance
141
D.1 United States
141
D.2 United States / Canada
141
D.3 Canada
142
D.4 CE
142
D.5 Japan
143
D.6 Australia and New Zealand
145
D.7 China
146
Related product manuals
Nvidia DGX-2 SYSTEM
109 pages
Nvidia RTX
29 pages
Nvidia RTX BLADE SERVER
40 pages
Nvidia DGX A100
118 pages
Nvidia DGX H100
92 pages
Nvidia DRIVE AGX
9 pages
Nvidia JETSON DEVELOPER KIT
32 pages
Nvidia Mellanox ConnectX-6 Dx
68 pages
ConnectX-6 Dx MCX623432AE-ADAB
75 pages