Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 9 Next »

The backend job scheduler is Gridengine, which is different from the PBS used by NCI, but both work similarly

Resource Requests

The table below summarises the major resources attributes which are commonly used in most of the jobs. There are also other attributes which are very helping for fine tuning how a job should be scheduled, whose details are provided in the sub sections.

Resources

Attribute

Description

Default Value

Parallel Environments (-pe)

smp

Allocate X number of CPUs on the SAME compute node

mpi

Allocate X number of CPUs from multiple compute node, this is mainly used by a job that implements under Open MPI framework.

Resource request list (-l)

mem

The amount of memory will be used by a job

jobfs

The amount of disk space will be used by a job

walltime

The run time limit for a job

ngpus

The number of GPGPUs

GPGPU

TODO

Local Scratch

TODO

Rack Awareness

TODO

Submit a Batch Job

A batch job can be submitted by using command qsub, in the following pattern:

# submit a job which calls a script (bash, shell, python scripts etc)
qsub -N JOB_NAME -pe smp NUMBER_OF_CPU -l ATTR1=VAL1,ATTR2=VAL2 SCRIPT

# submit a job which calls a BINARY (anything which are not script, such as sleep, dd etc)
qsub -N JOB_NAME -pe smp NUMBER_OF_CPU -l ATTR1=VAL1,ATTR2=VAL2 -b y BINARY

Examples

# a very big sleep job that needs 16 x CPUs, 2 x GPGPUs, 64GB memory, 10G disk space
qsub -b y -N generic_gpgpu -pe smp 16 -l ngpus=2,mem=65G,jobfs=10G sleep 1m

# a smaller sleep job that requires the specific A2 GPGPU...
qsub -b y -N t1000_gpgpu -pe smp 8 -l ngpus=2,gpgpu_model=A2,mem=16G,jobfs=10G sleep 1m

# a big job runs on multiple H100 nodes inside the same physical rack/cabinet F (rack awareness)
qsub -b y -N h100_gpgpu -pe mpi 256 -l ngpus=2,gpgpu_model=H100,rack=f,mem=128G,jobfs=100G sleep 1m

Submission Script

For larger and more complex analysis, the qsub submission script can be very useful. A submission script contains pre-populated qsub parameters, can be reused, distributed and version controlled easily. It looks like:

#!/bin/bash
#
# It prints the actual path of the job scratch directory.
#$ -pe smp 8
#$ -j y
#$ -e logs/$JOB_ID_$JOB_NAME.out
#$ -o logs/$JOB_ID_$JOB_NAME.out
#$ -cwd
#$ -N dd_smp
#$ -l mem=1G,jobfs=110G,tmpfree=150G,walltime=00:30:00
#

echo "$HOST $tmp_requested $TMPDIR"

# about 107GB
dd if=/dev/zero of=$TMPDIR/dd.test bs=512M count=200

To submit

z1234567@login01:~$ qsub sge_dd_smp.sh
Your job 31 ("dd_smp") has been submitted
  • No labels