A DGX server reports degraded performance and storage alerts. How would you use NVSM and nvidia-smi to troubleshoot both system and GPU issues?
During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?
You are a network administrator responsible for configuring an East-West (E/W) Spectrum-X fabric using SuperNIC. The Bluefield-3 devices in your network should be set to NIC mode with RoCE enabled to optimize data flow between servers. You have access to the Spectrum-X management tools and the necessary documentation. You need to use specific configuration commands to achieve this setup. Which of the following steps and commands are necessary to configure the Bluefield-3 devices in NIC mode for the E/W Spectrum-X fabric using SuperNIC? (Pick the 2 correct responses below)
Refer to the output:
~ $ sudo nvsm show healthinfo
—Timestamp: Sat Dec 16 16:26:32 2017 -0800
Version: 17.12-5
Checks—BIOS Revision [5.11].........................
DGX Serial Number [YSY72800016)..................
Verify installed DIMM memory sticks........................Healthy
...[output truncated)
Verify Ethernet controllers...........................Healthy
Verify installed GPU ' s..............................Unhealthy
Checking output of ' lspci ' for expected GPU ' s
Missing GPU at PCI address ' 07:00.0 '
Verify installed InfiniBand controllers....................Healthy
Verify PCIe switches..................................Healthy
...[output truncated)
What insights can a system administrator gain regarding the DGX system ' s health?
A system administrator wants to configure MIG for seven slices on an H100 GPU in an NVIDIA HGX system. Which command should be used?
An engineer needs to completely remove NVIDIA GPU drivers from an Ubuntu 22.04 system to troubleshoot conflicts. Which command sequence ensures all driver components are purged?
A 24-hour HPL burn-in fails with " illegal value " errors during the first iteration. Which initial troubleshooting step resolves this without compromising burn-in validity?
You are following the official steps to install the NVIDIA Container Toolkit using a package manager on Ubuntu. After importing the NVIDIA package repository and GPG key, what is the next action?
After upgrading to HPL-AI 2.0 on a DGX A100 cluster, a 2x performance gain is observed. Which optimization is primarily responsible for this improvement?
During cluster validation, the Cable Validation Tool (CVT) reports " Underperforming (BER) " for an InfiniBand link. Which BER thresholds indicate a critical signal quality issue requiring cable replacement?