Run GPU Job

GPU Resources

The table below summarises the CSG managed resources for GPU :

Resources	PATH/WHERE	Remark

Resources	PATH/WHERE	Remark
CUDA from OS	`/usr/local/cuda/`
CUDA Containers	`/share/apps/sif/x11vnc`	Software and Containers \| Apptainer
CUDA Module	`module avail cuda/12.5.1-gcc-12.2.0-tk2uq2c`	Software and Containers \| Environment Modules
Conda/python Environments	`/share/apps/noarch/miniforge3/envs/`	https://cseunsw.atlassian.net/wiki/x/XgCjBQ

Submit GPU Job

The job scheduler allocates GPU resources on a semi best-effort basis. If it cannot allocate the requested number of GPUs (usually due to minor hardware failures), it will still dispatch the job to run. This approach helps to avoid wasting the user's queued time. It is up to the user or job to handle this at the application level (such as exit anyway, or switch to use CPU)

GPU resources are tightly managed by the job scheduler. To use a GPU, users must specify the request in their job submission. A typical qsub command looks like:

qsub -b y -N gpu_job -pe smp 8 -l ngpus=2,gpu_model=A2,mem=16G,jobfs=10G GPU_JOB_CMD

The following GPU related attributes are available:

Attribute	Description

Attribute	Description
ngpus	Integer value, the number of GPUs
gpu_model	String value, specifying a particular GPU model
gpu_code	String value, specifying a particular GPU family via its codename

View Available Hardware

# This gives the details of the specified compute node. Remove the '-h' will print out
# the details of ALL compute nodes.
$ qhost -F ngpus,gpu_model,gpu_code -h wp-zeta-c20
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
wp-zeta-c20.cse.unsw.edu.au lx-amd64       96    2   48   96  4.78 1007.4G    3.7G  977.0M    9.0M
    Host Resource(s):      hc:ngpus=2.000000
   hf:gpu_code=GH100
   hf:gpu_model=H100_NVL
   
# To view a GPU mopdel summary:
$ qhost -F | grep gpu_model | sort -u
   hf:gpu_model=A2
   hf:gpu_model=H100_NVL
   hf:gpu_model=L4
   hf:gpu_model=L40S
   hf:gpu_model=T1000_8GB

Job Environment

GPU resources are allocated and assigned on a per-job request basis. The following environment variables, containing details of the allocated resources, are injected into the job. It is up to the user/job to utilize this information as needed by the application.

Variable	Description	Note/Example

Variable	Description	Note/Example
`CUDA_DEVICE_ORDER`	CUDA variable, fixed value, normally `PCI_BUS_ID`
`CUDA_VISIBLE_DEVICES`	CUDA variable, a list of comma separated UUIDs	`CUDA_VISIBLE_DEVICES: GPU-eaf4802f-e78d-e9d8-6381-afe98b201dae,GPU-aa3ba611-cc7d-c6db-bb32-905d732b4427`
`APPTAINERENV_CUDA_VISIBLE_DEVICES`	Used by Apptainer container, value same as `CUDA_VISIBLE_DEVICES`	https://apptainer.org/docs/user/main/gpu.html#multiple-gpus

CSE Research Computing

Run GPU Job

Analytics

GPU Resources

Submit GPU Job

View Available Hardware

Job Environment

Related content