A system administrator of a high-performance computing (HPC) cluster that uses an InfiniBand fabric for high-speed interconnects between nodes received reports from researchers that they are experiencing unusually slow data transfer rates between two specific compute nodes. The system administrator needs to ensure the path between these two nodes is optimal.
What command should be used?
A system administrator is troubleshooting a Docker container that is repeatedly failing to start. They want to gather more detailed information about the issue by generating debugging logs.
Why would generating debugging logs be an important step in resolving this issue?
A Slurm user is experiencing a frequent issue where a Slurm job is getting stuck in the “PENDING” state and unable to progress to the “RUNNING” state.
Which Slurm command can help the user identify the reason for the job’s pending status?
You are managing a Slurm cluster with multiple GPU nodes, each equipped with different types of GPUs. Some jobs are being allocated GPUs that should be reserved for other purposes, such as display rendering.
How would you ensure that only the intended GPUs are allocated to jobs?
What steps should an administrator take if they encounter errors related to RDMA (Remote Direct Memory Access) when using Magnum IO?
A system administrator needs to lower latency for an AI application by utilizing GPUDirect Storage.
What two (2) bottlenecks are avoided with this approach? (Choose two.)
A Slurm user needs to display real-time information about the running processes and resource usage of a Slurm job.
Which command should be used?
You are tasked with deploying a deep learning framework container from NVIDIA NGC on a stand-alone GPU-enabled server.
What must you complete before pulling the container? (Choose two.)