Running jobs on Balena

Overview:

  • Teaching: 10 min
  • Exercises: 5 min

Questions

  • How do I run a job on Balena?
  • How does my job run?
  • Do I have to queue to run tests and develop jobs or codes?
  • What if I want to run longer jobs?

Objectives

  • Know how to run batch and interactive jobs
  • Understand how mpi jobs execute
  • Know how to use the interactive queue to develop code in realtime
  • Know to use the development queue to benchmark performance
  • Understand the limits on priority and size of jobs for free and premium queues

Anatomy of a jobscript

The main way of submitting a job is to submit a jobscript with the squeue command. The inputs files for the course that youo copied earlier include example jobscripts for you to modify. We will being by inspecting the jobscript hello.slurm. First we need to move to the correct directory. If you followed the previous instructions the training materials will be in ~/scratch/balena-intro. We can change to this directory and display the jobscript:

cd ~/scratch/
cat balena-intro/hello/hello.slurm

The jobscript consists of several parts, the first line tells the operating system that the file is a script.

This is followed by a series of lines beginning #SBATCH, which are specific instructions to the Slurm scheduler. Some of these such as --job-name are self-explanatory. Others you can hopefully interpret after earlier episodes e.g. --partition. Some need further clarification:

--account=AAA-AAAXX project account. If you pay for a premium account then you will have an account with a form like this. If you use the free account you should replace the code with free. In order for you to run the training we provide you with an account so that you do not have to wait as long in the queue.

--reservation=training even if you have a premium account you have higher priority but your jobs will not necessarily run immediately. We have also reserved part of Balena for the training so that there are specific nodes free to run your jobs. This would normally be removed from the jobscript, so after the lesson should be removed. If you keep it your job will try to find a reservation that won't exist as it will be removed once this training is finished.

The next section loads modules that enable us to run parallel jobs using the intel libraries. Notice how we explcitly purge current modules then add in just the modules that are required for the job to run.

The next section creates a small script. If you are not familar with linux scripts don't worry too much about this. It is a standard when writing a program in a new tool or with a new language that you write a variant on Hello world to print hello to the screen. These lines of code essentially creates a simple parallel version of the standard Hello world program.

The final line of the script executes a the script in parallel. You can submit the script, with:

sbatch balena-intro/hello/hello.slurm

Check that the job has run correctly by cating the output files hello.out and hello.err.

Modify parallel execution

Modify the final line of the jobscript to the form:

mpirun -n X hello.sh

where X is the number of parallel processes we wish to launch. Submit the job with different values of X and explore the output.

Testing and development

Thus far we have run the job through the queue, however sometimes we want to check whether our job will or code are working correctly. To test this you have two options. Firstly if you just want to check a jobscript you can run up to 4 nodes for up to 15 minutes in the development partition by setting the partition to batch-devel.

Alternatively we may want to run our code interactively for which we have a number of options. By default interactive jobs will be launched on the itd nodes, which are a shared resource which may be used by multiple users at the same time:

sinteractive

which will launch you job instantly, or you can use the short form sint. Alternatively you can run interactively by launching the interactive job with:

sinteractive --partition=[partition] <--account=[account id]> <--reservation=[reservation]>

where you should specify the partition, account and reservation if needed. Other parameters such as time and nodes can also be specified.

Interactive hello

Launch an interactive job and run the hello script. Don't forgot to load any modules needed by the script.

Job monitoring

Monitoring tools allow you to check the progress of your job e.g. CPU and Memory usage. We will explore who you can do this with the sample HPL job. Change to the job directory and submit the job:

$ cd $SCRATCH/balena-intro/hpl
$ sbatch sample-hpl.slurm

Using the jobid which is returned when you submit the job, get the nodelist where your job is running:

$ squeue --job <jobid>

Logon to the node

$ ssh <nodename>

While you are on the compute node, you can use the following tools to check the CPU and memory usage: top (CPU, Memory) Top is a standard command available on linux systems. Press “ q ” to exit top.

$ top

htop (CPU, Memory) is a slightly more modern tool which allows you to see CPU usage across all the cores on the node. Press “ q ” to exit top.

$ module load htop
$ htop

Job size and QoS

If you are running free jobs your jobs are limited to 24node hours (nodes * time) with maximum's of 32 nodes and 6 hours. Thus if you run at larger node count you can run for shorter times. If you need to run for longer, or larger node counts, or just want your jobs to have higher priority then you can pay for a premium account.

You can only run on interactive job, and one development job at a time.

Key Points:

  • You can submit jobcripts to the scheduler using sbatch
  • #SBATCH directives set parameters for the job
  • You can run interactive jobs with sinteractive
  • MPI jobs run the same code on many CPUs at the same time
  • If your jobs are bigger or you want shorter queue times you can pay for a premium account.