Run GPU Job
GPU Resources
The table below summarises the CSG managed resources for GPU :
Resources | PATH/WHERE | Remark |
---|---|---|
CUDA from OS |
|
|
CUDA Containers |
| |
CUDA Module |
| |
Conda/python Environments |
|
Submit GPU Job
The job scheduler allocates GPU resources on a semi best-effort basis. If it cannot allocate the requested number of GPUs (usually due to minor hardware failures), it will still dispatch the job to run. This approach helps to avoid wasting the user's queued time. It is up to the user or job to handle this at the application level (such as exit anyway, or switch to use CPU)
GPU resources are tightly managed by the job scheduler. To use a GPU, users must specify the request in their job submission. A typical qsub
command looks like:
qsub -b y -N gpu_job -pe smp 8 -l ngpus=2,gpu_model=A2,mem=16G,jobfs=10G GPU_JOB_CMD
The following GPU related attributes are available:
Attribute | Description |
|
---|---|---|
ngpus | Integer value, the number of GPUs |
|
gpu_model | String value, specifying a particular GPU model |
|
gpu_code | String value, specifying a particular GPU family via its codename |
|
View Available Hardware
# This gives the details of the specified compute node. Remove the '-h' will print out
# the details of ALL compute nodes.
$ qhost -F ngpus,gpu_model,gpu_code -h wp-zeta-c20
HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
global - - - - - - - - - -
wp-zeta-c20.cse.unsw.edu.au lx-amd64 96 2 48 96 4.78 1007.4G 3.7G 977.0M 9.0M
Host Resource(s): hc:ngpus=2.000000
hf:gpu_code=GH100
hf:gpu_model=H100_NVL
# To view a GPU mopdel summary:
$ qhost -F | grep gpu_model | sort -u
hf:gpu_model=A2
hf:gpu_model=H100_NVL
hf:gpu_model=L4
hf:gpu_model=L40S
hf:gpu_model=T1000_8GB
Job Environment
GPU resources are allocated and assigned on a per-job request basis. The following environment variables, containing details of the allocated resources, are injected into the job. It is up to the user/job to utilize this information as needed by the application.
Variable | Description | Note/Example |
---|---|---|
| CUDA variable, fixed value, normally |
|
| CUDA variable, a list of comma separated UUIDs |
|
| Used by Apptainer container, value same as |