Monitoring
Thermal Sensors
The DPU incorporates the DPU SoC, which operates in the range of temperatures
between 0°C and 105°C.
Three thermal threshold definitions impact the overall system operation state:
Warning – 105°C: On managed systems only: When the device crosses the 105°C
threshold, a Warning Threshold message is issued by the management SW,
indicating to system administration that the card has crossed the warning
threshold. Note that this temperature threshold does not require nor lead to any
action by hardware (such as DPU shutdown).
Critical – 115°C: When the device crosses this temperature, the firmware
automatically shuts down the device.
Emergency – 130°C: If the firmware fails to shutdown the device upon crossing the
critical threshold, the device automatically shuts down upon crossing the
emergency (130°C) threshold.
The DPU's thermal sensors can be read through the system’s SMBus. The user can read
these thermal sensors and adapt the system airflow following the readouts and the
needs of the above-mentioned SoC thermal requirements.
Heatsink
The heatsink is attached to the DPU by three screws to dissipate the heat from the SoC.
The DPU SoC has a thermal shutdown safety mechanism that automatically shuts down
the DPU in cases of high-temperature events, improper thermal coupling, or heatsink
removal.
Refer to the below table for heatsink details per card configuration. For the required
airflow (LFM) per OPN, please refer to the NVIDIA BlueField-2 DPUs Power and Airflow
Specifications document, available at NVOnline following login.