Using GPUs
Here is a detailed guide on how to leverage the GPUs on the NBI cluster.
Preparation work: making sure that your software is GPU-aware
Before proceeding, it is recommended to have your local installation of (ana-/mini-)conda with newer python version (preferably 3.10+).
Installing CUDA
Normally, you would/should use the system-wide CUDA installation to make sure that it is compatible with the GPUs. In fact, there are environment modules for CUDA (e.g. cuda/11.2
; note: you will need to first load the astro
module) pre-installed on the system.
Here we take a different route -- we install our own (and a newer version of) CUDA for greater control. Usually you would want to install the latest CUDA that your GPUs support, but as of the time of this writing, torch
lacks the support for the latest CUDA version 12.x so we opt for an earlier release (11.8).
To install CUDA via conda
, do
conda install cuda -c nvidia/label/cuda-11.8.0
You should check that your installation works by running
nvcc --version
This should match the version that you just installed.
(NOTE: This number can be different from the number reported in nvidia-smi
, since the number in nvidia-smi
is the latest CUDA that is supported by the driver installed. In other words, you should make sure that the CUDA you installed is 'at most' that version)
Installing torch
Once you have CUDA properly installed, everything else should be a breeze. To install torch
with CUDA awareness, simply do
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
Test your installation with the following simple code snippet
import torch torch.zeros(100).cuda()
If there is no error message, that means you have now installed torch with CUDA support successfully.
Installing cupy
Again, if you have CUDA installed, the installation of cupy
is very straightforward. Simply run
conda install -c conda-forge cupy cudnn cutensor nccl
Note that conda
will intelligently (and hopefully) detect the proper version to installed with your current installation of CUDA.
Test your installation with the following simple code snippet
import cupy cupy.random.rand(100).device
It should say something like <CUDA Device 0>
.
Installing jax
Installation of jax
with CUDA is also simple. Run
pip install --upgrade "jax[cuda11_pip]" -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
Test your installation with
import jax jax.devices()
It should say something like [cuda(id=0)]
Running a job directly on a GPU-equipped headnode
The GPU-equipped headnode/frontend is astro02
(the node is accessible with astro02.hpc.ku.dk
). There are physically 3 Nvidia-A30 GPUs. One of them is virutally split into 3 smaller and independent virtual GPUs (in Nvidia's term -- MIG or Multi-Instance GPU), one split into 2 smaller MIGs, and one remains 'unsplit'.
To specify which GPU to use, set the environment variable CUDA_VISIBLE_DEVICES
. To see the list of 'compute instances' available, run
nvidia-smi -L
On astro02
, you should see something like
GPU 0: NVIDIA A30 (UUID: GPU-654aa619-952d-3f17-01ec-0c050ac8df88) MIG 1g.6gb Device 0: (UUID: MIG-3868837f-57d0-5089-9887-19240a8809b4) MIG 1g.6gb Device 1: (UUID: MIG-d28bcf9f-db13-5ad0-9be2-62d0e25c92a9) MIG 1g.6gb Device 2: (UUID: MIG-e175ec33-0f38-5952-98d5-1c118bd9d398) MIG 1g.6gb Device 3: (UUID: MIG-53cc4525-2ae7-5c11-9680-302d1d4177ba) GPU 1: NVIDIA A30 (UUID: GPU-cb8c2438-a361-3e30-4ff5-4481d43c9e83) MIG 2g.12gb Device 0: (UUID: MIG-0a768004-2ded-55f6-ac2b-4dd3f696a222) MIG 2g.12gb Device 1: (UUID: MIG-0296d938-ea26-5174-a884-cd3c686bf660) GPU 2: NVIDIA A30 (UUID: GPU-9bcd54bd-5a72-2e7b-90c8-3e3719d09e5c) MIG 4g.24gb Device 0: (UUID: MIG-a8cb1bd5-6f68-54a1-8e88-ca2fa4ef80c0)
For example, if we want to use the third MIG 1g.6gb
instance with the UUID MIG-e175ec33-0f38-5952-98d5-1c118bd9d398
, set the environment variable
export CUDA_VISIBLE_DEVICES=MIG-e175ec33-0f38-5952-98d5-1c118bd9d398
Then running the same test code for torch
and checking with nvidia-smi
, we see that
+---------------------------------------------------------------------------------------+ | MIG devices: | +------------------+--------------------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG | | | | ECC| | |==================+================================+===========+=======================| | 0 3 0 0 | 12MiB / 5952MiB | 14 0 | 1 0 1 0 0 | | | 0MiB / 8191MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 4 0 1 | 12MiB / 5952MiB | 14 0 | 1 0 1 0 0 | | | 0MiB / 8191MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 5 0 2 | 107MiB / 5952MiB | 14 0 | 1 0 1 0 0 | | | 2MiB / 8191MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 0 6 0 3 | 12MiB / 5952MiB | 14 0 | 1 0 1 0 0 | | | 0MiB / 8191MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 1 1 0 0 | 25MiB / 11968MiB | 28 0 | 2 0 2 0 0 | | | 0MiB / 16383MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 1 2 0 1 | 25MiB / 11968MiB | 28 0 | 2 0 2 0 0 | | | 0MiB / 16383MiB | | | +------------------+--------------------------------+-----------+-----------------------+ | 2 0 0 0 | 1MiB / 24062MiB | 56 0 | 4 0 4 1 1 | | | 1MiB / 32768MiB | | | +------------------+--------------------------------+-----------+-----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 5 0 530009 C ...nda3/envs/igwn-py310/bin/python3.10 88MiB | +---------------------------------------------------------------------------------------+
Indeed we are using the desired MIG.
Submitting a job to the GPU partition with slurm
Simply specify the GPU partition, astro2_gpu
, and how many ‘generic resources (GRES)’ (in this case, GPU), that you want to use when submitting a job with slurm
.
An example command is
srun -p astro2_gpu --gres=gpu:1 nvidia-smi
This should show the GPU (not the virtual one/MIG) that is being assigned to you.
As far as I know, there are 11 Nvidia A100 GPUs in this partition.