Run GPU Job

GPU Resources

The table below summarises the CSG managed resources for GPU :

Resources

PATH/WHERE

Remark

Resources

PATH/WHERE

Remark

CUDA from OS

/usr/local/cuda/

 

CUDA Containers

/share/apps/sif/x11vnc

Software and Containers | Apptainer

CUDA Module

module avail cuda/12.5.1-gcc-12.2.0-tk2uq2c

Software and Containers | Environment Modules

Conda/python Environments

/share/apps/noarch/miniforge3/envs/

https://cseunsw.atlassian.net/wiki/x/XgCjBQ

Submit GPU Job

The job scheduler allocates GPU resources on a semi best-effort basis. If it cannot allocate the requested number of GPUs (usually due to minor hardware failures), it will still dispatch the job to run. This approach helps to avoid wasting the user's queued time. It is up to the user or job to handle this at the application level (such as exit anyway, or switch to use CPU)

GPU resources are tightly managed by the job scheduler. To use a GPU, users must specify the request in their job submission. A typical qsub command looks like:

qsub -b y -N gpu_job -pe smp 8 -l ngpus=2,gpu_model=A2,mem=16G,jobfs=10G GPU_JOB_CMD

The following GPU related attributes are available:

Attribute

Description

 

Attribute

Description

 

ngpus

Integer value, the number of GPUs

 

gpu_model

String value, specifying a particular GPU model

 

gpu_code

String value, specifying a particular GPU family via its codename

 

View Available Hardware

# This gives the details of the specified compute node. Remove the '-h' will print out # the details of ALL compute nodes. $ qhost -F ngpus,gpu_model,gpu_code -h wp-zeta-c20 HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - wp-zeta-c20.cse.unsw.edu.au lx-amd64 96 2 48 96 4.78 1007.4G 3.7G 977.0M 9.0M Host Resource(s): hc:ngpus=2.000000 hf:gpu_code=GH100 hf:gpu_model=H100_NVL # To view a GPU mopdel summary: $ qhost -F | grep gpu_model | sort -u hf:gpu_model=A2 hf:gpu_model=H100_NVL hf:gpu_model=L4 hf:gpu_model=L40S hf:gpu_model=T1000_8GB

Job Environment

GPU resources are allocated and assigned on a per-job request basis. The following environment variables, containing details of the allocated resources, are injected into the job. It is up to the user/job to utilize this information as needed by the application.

Variable

Description

Note/Example

Variable

Description

Note/Example

CUDA_DEVICE_ORDER

CUDA variable, fixed value, normally PCI_BUS_ID

 

CUDA_VISIBLE_DEVICES

CUDA variable, a list of comma separated UUIDs

CUDA_VISIBLE_DEVICES: GPU-eaf4802f-e78d-e9d8-6381-afe98b201dae,GPU-aa3ba611-cc7d-c6db-bb32-905d732b4427

APPTAINERENV_CUDA_VISIBLE_DEVICES

Used by Apptainer container, value same as CUDA_VISIBLE_DEVICES

https://apptainer.org/docs/user/main/gpu.html#multiple-gpus