Use of a Workload Manager

INDEX

GSWHC-B Getting Started with HPC Clusters $\rightarrow$ USE2.1-B Use of a Workload Manager

Relevant for: Tester, Builder, and Developer

Description:

You will learn to use a workload manager like SLURM or TORQUE to allocate HPC resources (e.g. CPUs) and to submit a batch job (basic level)
You will learn to use shell scripts (basic level) (see also USE1.2-B Using Shell Scripts)
You will learn to use the command line interface (basic level) (see also USE1.1-B Use of the Command Line Interface)

This skill requires the following sub-skills

USE1.1-B Use of the Command Line Interface ($\leftarrow$ USE1-B Use of the Cluster Operating System)
USE1.2-B Using Shell Scripts ($\leftarrow$ USE1-B Use of the Cluster Operating System)

Level: basic

Workload managers

Batch jobs submitted to a job queue define the workloads in batch systems. A workload manager of a cluster system typically deals with:

Job Control to provide a user interface for submitting jobs to job queues, monitoring their state during processing (e.g. to check their estimated starting time), and intervening in their execution (e.g. to abort them manually)
Scheduling and Resource Management to select a waiting job for execution and to allocate nodes to the job meeting all its other demands for computing resources (memory, special processing elements like GPUs, etc.)
Accounting to record historical data about how many computing resources (e.g. computing time) have been consumed by a job

SLURM (Simple Linux Utility for Resource Management) and TORQUE (Terascale Open-source Resource and QUEue Manager) are widely used open source workload managers for large and small Linux clusters. Both workload managers are controlled via a CLI (Command Line Interface) and offer similar features.

In recent years there appears to be a general trend that scientific institutions substitute TORQUE (or OpenPBS (Open Portable Batch System), from which TORQUE was derived/forked) by SLURM. One reason for this is that SLURM includes all functionality out of the box, while TORQUE typically has to be integrated with additional software like the Maui scheduler or the commercially available Moab Cluster Suite, which was based on Maui, to efficiently make use of the cluster resources.

SLURM

There are three key functions of SLURM described on the SLURM website:

“… First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. …”

SLURM’s default scheduling is based on a FIFO-queue, which is typically enhanced with the Multifactor Priority Plugin to achieve a very versatile facility for ordering the queue of jobs waiting to be scheduled. In contrast to other workload managers SLURM does not use several job queues. Cluster nodes in a SLURM configuration can be assigned to multiple partitions by the cluster administrators instead. This enables the same functionality.

A compute center will seek to configure SLURM in a way that resource utilization and throughput are maximized, waiting times and turnaround times are minimized, and all users are treated fairly.

The basic functionality of SLURM can be divided into three areas:

Job submission and cancellation
Monitoring job and system information
Retrieving accounting information

Job submission and cancellation

There are three commands for handling job submissions:

sbatch
- submits a batch job script to SLURM’s job queue for (later) execution. The batch script may be given to sbatch by a file name on the command line or can be read from stdin. Resources needed by the job may be specified via command line options and/or directly in the job script. A job script may contain several job steps to perform several parallel tasks within the same script. Job steps themselves may be run sequentially or in parallel. SLURM regards the script as the first job step.
salloc
- allocates a set of nodes, typically for interactive use. Resources needed may be specified via command line options.
srun
- usually runs a command on nodes previously allocated via sbatch or salloc. Each invocation of srun within a job script corresponds to a job step and launches parallel tasks across the allocated resources. A task is represented e.g. by a program, command, or script. If srun is not invoked within an allocation it will via command line options first create a resource allocation in which to run the parallel job.

SLURM assigns a unique jobid (integer number) to each job when it is submitted. This jobid is returned at submission time or can be obtained from the squeue command.

The scancel command is used to abort a job or job step that is running or waiting for execution.

The scontrol command is mainly used by cluster administrators to view or modify the configuration of the SLURM system but it also offers the users the possibility to control their jobs (e.g. to hold and release a pending job).

The Table below lists basic user activities for job submission and cancellation and the corresponding SLURM commands.

User activities for job submission and cancellation
(user supplied information is given in *italics*)
User activity	SLURM command
Submit a job script for (later) execution	`sbatch` job-script
Allocate a set of nodes for interactive use	`salloc` --nodes=N
Launch a parallel task (e.g. program, command, or script) within allocated resources by `sbatch` (i.e. within a job script) or `salloc`	`srun` task
Allocate a set of nodes and launch a parallel task directly	`srun` --nodes=N task
Abort a job that is running or waiting for execution	`scancel` jobid
Abort all jobs of a user	`scancel` --user=username or generally `scancel` --user=$USER
Put a job on hold (i.e. pause waiting) and Release a job from hold (These related commands are rarely used in standard operation.)	`scontrol` hold jobid `scontrol` release jobid

The major command line options that are used for sbatch and salloc are listed in the Table below. These options can also be specified for srun, if srun is not used in the context of nodes previously allocated via sbatch or salloc.

Major `sbatch` and `salloc` options
Specification	Option	Comments
Number of nodes requested	--nodes=N
Number of tasks to invoke on each node	--tasks-per-node=n	Can be used to specify the number of cores to use per node, e.g. to avoid hyper-threading. (If option is omitted, all cores and hyperthreads are used; Hint: using hyperthreads is not always advantageous.)
Partition	--partition= partitionname	SLURM supports multiple partitions instead of several queues
Job time limit	--time=time-limit	time-limit may be given as minutes or in `hh:mm:ss` or `d-hh:mm:ss` format (`d` means number of days)
Output file	--output=out	Location of stdout redirection

For the sbatch command these options may also be specified directly in the job script using a pseudo comment directive starting with #SBATCH as a prefix. The directives must precede any executable command in the batch script:

#!/bin/bash
#SBATCH --partition=std
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --time=00:10:00
...
srun ./helloParallelWorld

A complete list of parameters can be retrieved from the man pages for sbatch, salloc, or srun, e.g. via

man sbatch

Monitoring job and system information

There are four commands for monitoring job and system information:

sinfo
- shows current information about nodes and partitions for a system managed by SLURM. Command line options can be used to filter, sort, and format the output in a variety of ways. By default it essentially shows for each partition if it is available and how many nodes and which nodes in the partition are allocated or idle (or are possibly in another state like down or drain, i.e. not available for some time). This is useful for the user e.g. to decide in which partition to run a job. The number of allocated and idle nodes indicates the actual utilization of the cluster.
squeue
- shows current information about jobs in the SLURM scheduling queue. Command line options can be used to filter, sort, and format the output in a variety of ways. By default it lists all pending jobs, sorted descending by their priority, followed by all running jobs, sorted descending by their priority. The major job states are:
  - R for Running
  - PD for Pending
  - CD for Completed
  - F for Failed
  - CA for Cancelled
  The TIME column shows for running jobs their execution time so far (or 0:00 for pending jobs).
  The NODELIST (REASON) column shows either on which nodes a job is running or why the job is pending. A job is pending for two main reasons:
  - it is still waiting for resources to become scheduled, shown as (Resources),
  - its priority is still not sufficient for it to become executed, shown as (Priority), i.e. there are other jobs with a higher priority pending in the queue.
  The position of a pending job in the queue indicates how many jobs are executed before and after it. The squeue command is the main way to monitor a job and can e.g. also be used to get the information about the expected starting time of a job (see Table below).
sstat
- is mainly used to display various status information of a running job taken as a snapshot. The information relates to CPU, task, node, Resident Set Size (RSS), and virtual memory (VM), etc.
scontrol
- is mainly used by cluster administrators to view or modify the configuration of the SLURM system, but it also offers users the possibility to get some information about the cluster configuration (e.g. about partitions, nodes, and jobs).

The Table below lists basic user activities for job and system monitoring and the corresponding SLURM commands.

User Activities for Job and System Monitoring
(user supplied information is given in italics;
brackets indicate optional specifications)
User activity	SLURM command
View information about currently available nodes and partitions. The state of a partition may be `UP`, `DOWN`, or `INACTIVE`. If the state is `INACTIVE`, no new submissions are allowed to the partition.	`sinfo` [--partition=partitionname]
View summary about currently available nodes and partitions. The `NODES(A/I/O/T)` column contains corresponding number of nodes being allocated, idle, in some other state and the total of the three numbers.	`sinfo` -s
Check the state of all jobs.	`squeue`
Check the state of all own jobs.	`squeue` --user=$USER
Check the state of a single job.	`squeue` -j jobid
Check the expected starting time of a pending job.	`squeue` --start -j jobid
Display status information of a running job (e.g. average CPU time, average Virtual Memory (VM) usage – see `sstat` --helpformat and `man` `sstat` for information on more options).	`sstat` --format=AveCPU, AveVMSize -j jobid
View SLURM configuration information for a partition cluster node (e.g. associated nodes).	`scontrol` show partition partitionname
View SLURM configuration information for a cluster node.	`scontrol` show node nodename
View detailed job information.	`scontrol` show job jobid

Retrieving accounting information

There are two commands for retrieving accounting information:

sacct
- shows accounting information for jobs and job steps in the SLURM job accounting log or SLURM database. For active jobs the accounting information is accessed via the job accounting log file. For completed jobs it is accessed via the log data saved in the SLURM database. Command line options can be used to filter, sort, and format the output in a variety of ways. Columns for jobid, jobname, partition, account, allocated CPUs, state, and exit code are shown by default for each of the user’s jobs eligible after midnight of the current day.

sacctmgr

is mainly used by cluster administrators to view or modify the SLURM account information, but it also offers users the possibility to get some information about their account. The account information is maintained within the SLURM database. Command line options can be used to filter, sort, and format the output in a variety of ways.

The Table below lists basic user activities for retrieving accounting information and the corresponding SLURM commands.

User Activities for Retrieving Accounting Information
(user supplied information is given in italics)
User Activity	SLURM Command
View job account information for a specific job.	`sacct` -j jobid
View all job information from a specific start date (given as `yyyy-mm-dd`).	`sacct` -S startdate -u $USER
View execution time for (completed) job (formatted as `days-hh:mm:ss`, cumulated over job steps, and without any header).	`sacct` -n -X -P -o Elapsed -j jobid

Examples

Submitting a batch job

Below an example script for a SLURM batch job – in the sense of a hello world program – is given. The job is suited to be run in the Hummel HPC cluster at Regional Computing Center / Regionales Rechenzentrum der Universität Hamburg (RRZ). For other cluster systems some appropriate adjustments will probably be necessary.

#!/bin/bash
# Do not forget to select a proper partition if the default
# one is no fit for the job! You can do that either in the sbatch
# command line or here with the other settings.
#SBATCH --job-name=hello
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --time=00:10:00
# Never forget that! Strange happenings ensue otherwise.
#SBATCH --export=NONE

set -e # Good Idea to stop operation on first error.

source /sw/batch/init.sh

# Load environment modules for your application here.

# Actual work starting here. You might need to call
# srun or mpirun depending on your type of application
# for proper parallel work.
# Example for a simple command (that might itself handle
# parallelisation).
echo "Hello World! I am $(hostname -s) greeting you!"
echo "Also, my current TMPDIR: $TMPDIR"

# Let's pretend our started processes are working on a
# predetermined parameter set, looking up their specific
# parameters using the set number and the process number
# inside the batch job.
export PARAMETER_SET=42
# Simplest way to run an identical command on all allocated
# cores on all allocated nodes. Use environment variables to
# tell apart the instances.
srun bash -c 'echo "process $SLURM_PROCID \
(out of $SLURM_NPROCS total) on $(hostname -s) \
parameter set $PARAMETER_SET"'

The job script file above can be stored e.g. in $HOME/hello_world.sh ($HOME is mapped to the user’s home directory). One peculiarity of the Hummel cluster is that the user’s home directory is write protected for a batch job to avoid storing batch job results unintentionally there.

A corresponding working (sub-) directory can be created before the hello_world.sh job is submitted:

[exampleusername@node001 14:48:33]~$ mkdir $WORK/hello_workdir
[exampleusername@node001 14:48:33]~$ cd $WORK/hello_workdir

The working directory $WORK of a user is a location that is accessible to all nodes of a user’s job.

The job is submitted to SLURM’s batch queue using the default value for partition (scontrol show partitions (also see above) can be used to show that information):

[exampleusername@node001 14:48:33]~$ sbatch $HOME/hello_world.sh
Submitted batch job 123456

The output of sbatch will contain the jobid, like 123456 in this example.

During execution the output of the job is written to a file, named slurm-123456.out in this example:

[exampleusername@node001 14:48:33]~$ cat slurm-123456.out
module: loaded site/slurm
module: loaded site/tmpdir
module: loaded site/hummel
module: loaded env/system-gcc
Hello World! I am node223 greeting you!
Also, my current TMPDIR: /scratch/exampleusername.123456
process 8 (out of 32 total) on node223 parameter set 42
process 15 (out of 32 total) on node223 parameter set 42
process 4 (out of 32 total) on node223 parameter set 42
process 5 (out of 32 total) on node223 parameter set 42
process 9 (out of 32 total) on node223 parameter set 42
process 7 (out of 32 total) on node223 parameter set 42
process 3 (out of 32 total) on node223 parameter set 42
process 6 (out of 32 total) on node223 parameter set 42
process 11 (out of 32 total) on node223 parameter set 42
process 2 (out of 32 total) on node223 parameter set 42
process 13 (out of 32 total) on node223 parameter set 42
process 12 (out of 32 total) on node223 parameter set 42
process 1 (out of 32 total) on node223 parameter set 42
process 10 (out of 32 total) on node223 parameter set 42
process 0 (out of 32 total) on node223 parameter set 42
process 14 (out of 32 total) on node223 parameter set 42
process 28 (out of 32 total) on node224 parameter set 42
process 23 (out of 32 total) on node224 parameter set 42
process 26 (out of 32 total) on node224 parameter set 42
process 27 (out of 32 total) on node224 parameter set 42
process 30 (out of 32 total) on node224 parameter set 42
process 19 (out of 32 total) on node224 parameter set 42
process 18 (out of 32 total) on node224 parameter set 42
process 22 (out of 32 total) on node224 parameter set 42
process 25 (out of 32 total) on node224 parameter set 42
process 17 (out of 32 total) on node224 parameter set 42
process 29 (out of 32 total) on node224 parameter set 42
process 21 (out of 32 total) on node224 parameter set 42
process 24 (out of 32 total) on node224 parameter set 42
process 16 (out of 32 total) on node224 parameter set 42
process 31 (out of 32 total) on node224 parameter set 42
process 20 (out of 32 total) on node224 parameter set 42

If there had been errors (i.e. any output to the stderr stream) a corresponding file named slurm-123456.err would have been created.

Interactive usage under control of the batch system

Interactive sessions under control of the batch system can be created via salloc. salloc differs from sbatch by the fact that resources are initially only reserved (i.e. allocated) without executing a job script. Also, the session is running on the node on which salloc was invoked (but not on a compute node in contrast to submission with sbatch). This is often useful during the interactive development of a parallel program.

A single node is reserved for interactive usage as follows:

[exampleusername@node001 14:48:33]~$ salloc

When the resources are granted by SLURM, salloc will start a new shell on the (login or head) node where salloc was executed. This interactive session is terminated by exiting the shell or by reaching the time limit.

An OpenMP program using $N$ threads, for example, can be started on the allocated node as follows:

[exampleusername@node001 14:48:33]~$ export OMP_NUM_THREADS=N
[exampleusername@node001 14:48:33]~$ srun my-openmp-binary

To start an interactive parallel MPI program $N$ nodes can be allocated as follows:

[exampleusername@node001 14:48:33]~$ salloc --nodes=N

The MPI Program using $n=32$ processes, for example, can be started on the allocated nodes as follows:

[exampleusername@node001 14:48:33]~$ mpirun -np 32 my-mpi-binary

Another way to use the allocated nodes is to use ssh to establish connections to them.

TORQUE

Since PBS and TORQUE offer similar features as SLURM it is appropriate to show the corresponding commands and options in Tables. A good overview to this topic is also given e.g. at the Division of Information Technology at the University of Maryland.

Command comparison between PBS/TORQUE and SLURM
(user supplied information is given in *italics*)
User Activity	PBS/TORQUE	SLURM
Submitting a job	`qsub` jobscript	`sbatch` jobscript
Using nodes interactively	`qsub` -I [options]	`salloc` [options]
Deleting or canceling a job	`qdel` jobid	`scancel` jobid
Viewing job status	`qstat` jobid	`squeue` -j jobid
Viewing all of one’s own job status	`qstat`-u $USER	`squeue` -u $USER
Viewing all jobs in the queue	`qstat` [-a]	`squeue`
Checking expected starting time of a job		`squeue` --start -j jobid

Additional job parameters are listed below. These options are typically used in job scripts using a pseudo comment directive starting with a special prefix. The directives must precede any executable command in the batch script:

Option comparison between PBS/TORQUE and SLURM
(user supplied information is given in italics)
Job Parameter	PBS/TORQUE	SLURM
Directive	#PBS	#SBATCH
Job name	-N name	--name=name
Queue/Partition	-Q queuename	--partition= partitionname
Nodes	-l nodes=n	--nodes=n
Processes per node	-l ppn=n	--tasks-per-node=n
CPUs per task	-l ompthreads=n	--cpus-per-task=n
Job time limit	-l walltime=seconds -l walltime=hh:mm:ss	--time=minutes --time=hh:mm:ss
stdout redirection to a file	-o filename	--output=filename
stderr redirection to a file	-e filename	--error=filename
stdout and stderr redirection to one file	-oe filename	--output=filename and not specifying --error=filename
Email address	-m address	--mail-user=address
Notify on state change of job	-m b -m e -m a -m abe	--mail-type=BEGIN --mail-type=END --mail-type=FAIL --mail-type=ALL
Do not permit the job to be requeued after a node failure	-r n	--no-requeue

During job runtime several environment variables are automatically set:

Environment variable comparison between PBS/TORQUE and SLURM
Content of Environment Variable	PBS/TORQUE	SLURM
JobId	$PBS_JOBID	$SLURM_JOBID
Working directory from which the job was submitted	$PBS_O_WORKDIR	$SLURM_SUBMIT_DIR
List of allocated nodes	$PBS_NODEFILE (a filename)	$SLURM_JOB_NODELIST (node list itself)