Unlike your desktop or perhaps a group server Balena is a shared resource accessible to all reasearchers in the University. As such we need to manage how jobs are run to ensure that everyone gets to run fair share
and that resources are used efficiently.
If multiple jobs ran on a single node
at the same time users would be cometing for the same resources and jobs take longer to run overall. By managing jobs through the scheduler individual jobs are allocated to the resources they need as they become available. This results in a higher overall throughput and more consistent performance.
The are a number of scheduler's in use in HPC systems, on Balena we use slurm
. In order to interact with the scheduler you need to be familiar with a number of key slurm
commands:
Slurm command | Function |
---|---|
sinfo |
View information about SLURM nodes and partitions |
squeue |
List status of jobs in the queue |
squeue --user [userid] | Jobs by user |
squeue --job [jobid] | Jobs by jobid |
sbatch [jobscript] | Submit a jobscript to the scheduler |
scancel [jobid] | Cancel a job in the queue |
sshare | Show project and fairshare information |
scontrol hold [jobid] | Hold a job in the queue |
scontrol release [jobid] | Release a held job |
scontrol show job [jobid] | View information about a job |
scontrol show node nodename | Get information of a node |
scontrol show license | Get licenses available on SLURM |
sinfo
Each of these partitions have inrmative names that explain what each contains. The batch-acc
is used for all the nodes that have acelerators or off loading devices. We can find out more about these with:
sinfo Ne1 --partition batch-acc --format=nodelist,features,gres
sshare