Slurm: Balena's scheduler¶

Overview:

Teaching: 10 min
Exercises: 5 min

Questions

What is a scheduler?
How can I use slurm to manage and run my jobs?
What slurm commands can I use to explore Balena?
How I do I access information about my project accounts?

Objectives

Know that the schduler manages jobs on the service
Know how to interact with slurm to:
- See what jobs are running
- Check availability of different partitions
- Check how much resource you have available in your projects

Scheduler¶

Unlike your desktop or perhaps a group server Balena is a shared resource accessible to all reasearchers in the University. As such we need to manage how jobs are run to ensure that everyone gets to run fair share and that resources are used efficiently.

If multiple jobs ran on a single node at the same time users would be cometing for the same resources and jobs take longer to run overall. By managing jobs through the scheduler individual jobs are allocated to the resources they need as they become available. This results in a higher overall throughput and more consistent performance.

Slurm: Simple Linux Utility for Resource Management¶

The are a number of scheduler's in use in HPC systems, on Balena we use slurm. In order to interact with the scheduler you need to be familiar with a number of key slurm commands:

Slurm command	Function
`sinfo`	View information about SLURM nodes and partitions
`squeue`	List status of jobs in the queue
squeue --user [userid]	Jobs by user
squeue --job [jobid]	Jobs by jobid
sbatch [jobscript]	Submit a jobscript to the scheduler
scancel [jobid]	Cancel a job in the queue
sshare	Show project and fairshare information
scontrol hold [jobid]	Hold a job in the queue
scontrol release [jobid]	Release a held job
scontrol show job [jobid]	View information about a job
scontrol show node nodename	Get information of a node
scontrol show license	Get licenses available on SLURM

sinfo

Each of these partitions have inrmative names that explain what each contains. The batch-acc is used for all the nodes that have acelerators or off loading devices. We can find out more about these with:

sinfo Ne1 --partition batch-acc --format=nodelist,features,gres

sshare

Interrogating the queue and node information

Find how many nodes are currently available (idle) on the cluster
Find a list of running jobs in the batch partition
What is the priority of the top job waiting to run (pending)?

Key Points:

We use a scheduler to manage jobs on the HPC service, ensuring fair share and efficient use.
Balena uses the slurm scheduler
Key commands are:
- sbatch to submit a job
- sinfo to view information about the servive
- scancel to delete a job
- squeue to view the queue
Further information about commands can be found at slurm.schedmd.com

Slurm: Balena's scheduler¶

Overview:

Scheduler¶

Slurm: Simple Linux Utility for Resource Management¶

Interrogating the queue and node information

Key Points:

Previous

Schedule

Next