Use of a Workload Manager
GSWHC-B Getting Started with HPC Clusters \(\rightarrow\) USE2.1-B Use of a Workload Manager
Relevant for: Tester, Builder, and Developer
Description:
You will learn to use a workload manager like SLURM or TORQUE to allocate HPC resources (e.g. CPUs) and to submit a batch job (basic level)
You will learn to use shell scripts (basic level) (see also USE1.2-B Using Shell Scripts)
You will learn to use the command line interface (basic level) (see also USE1.1-B Use of the Command Line Interface)
This skill requires the following sub-skills
- USE1.1-B Use of the Command Line Interface (\(\leftarrow\) USE1-B Use of the Cluster Operating System)
- USE1.2-B Using Shell Scripts (\(\leftarrow\) USE1-B Use of the Cluster Operating System)
Level: basic
Workload managers
Batch jobs submitted to a job queue define the workloads in batch systems. A workload manager of a cluster system typically deals with:
Job Control to provide a user interface for submitting jobs to job queues, monitoring their state during processing (e.g. to check their estimated starting time), and intervening in their execution (e.g. to abort them manually)
Scheduling and Resource Management to select a waiting job for execution and to allocate nodes to the job meeting all its other demands for computing resources (memory, special processing elements like GPUs, etc.)
Accounting to record historical data about how many computing resources (e.g. computing time) have been consumed by a job
SLURM (Simple Linux Utility for Resource Management) and TORQUE (Terascale Open-source Resource and QUEue Manager) are widely used open source workload managers for large and small Linux clusters. Both workload managers are controlled via a CLI (Command Line Interface) and offer similar features.
In recent years there appears to be a general trend that scientific institutions substitute TORQUE (or OpenPBS (Open Portable Batch System), from which TORQUE was derived/forked) by SLURM. One reason for this is that SLURM includes all functionality out of the box, while TORQUE typically has to be integrated with additional software like the Maui scheduler or the commercially available Moab Cluster Suite, which was based on Maui, to efficiently make use of the cluster resources.
SLURM
There are three key functions of SLURM described on the SLURM website:
“… First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. …”
SLURM’s default scheduling is based on a FIFO-queue, which is typically enhanced with the Multifactor Priority Plugin to achieve a very versatile facility for ordering the queue of jobs waiting to be scheduled. In contrast to other workload managers SLURM does not use several job queues. Cluster nodes in a SLURM configuration can be assigned to multiple partitions by the cluster administrators instead. This enables the same functionality.
A compute center will seek to configure SLURM in a way that resource utilization and throughput are maximized, waiting times and turnaround times are minimized, and all users are treated fairly.
The basic functionality of SLURM can be divided into three areas:
Job submission and cancellation
Monitoring job and system information
Retrieving accounting information
Job submission and cancellation
There are three commands for handling job submissions:
-
- submits a batch job script to SLURM’s job queue for (later) execution. The batch script may be given to
sbatch
by a file name on the command line or can be read from stdin. Resources needed by the job may be specified via command line options and/or directly in the job script. A job script may contain several job steps to perform several parallel tasks within the same script. Job steps themselves may be run sequentially or in parallel. SLURM regards the script as the first job step.
- submits a batch job script to SLURM’s job queue for (later) execution. The batch script may be given to
-
- allocates a set of nodes, typically for interactive use. Resources needed may be specified via command line options.
-
- usually runs a command on nodes previously allocated via
sbatch
orsalloc
. Each invocation ofsrun
within a job script corresponds to a job step and launches parallel tasks across the allocated resources. A task is represented e.g. by a program, command, or script. Ifsrun
is not invoked within an allocation it will via command line options first create a resource allocation in which to run the parallel job.
- usually runs a command on nodes previously allocated via
SLURM assigns a unique jobid (integer number) to each job when it is submitted. This jobid is returned at submission time or can be obtained from the squeue
command.
The scancel
command is used to abort a job or job step that is running or waiting for execution.
The scontrol
command is mainly used by cluster administrators to view or modify the configuration of the SLURM system but it also offers the users the possibility to control their jobs (e.g. to hold and release a pending job).
The Table below lists basic user activities for job submission and cancellation and the corresponding SLURM commands.
User activity | SLURM command |
---|---|
Submit a job script for (later) execution | sbatch job-script |
Allocate a set of nodes for interactive use | salloc --nodes=N |
Launch a parallel task (e.g. program, command, or script) within allocated resources by sbatch (i.e. within a job script) or salloc |
srun task |
Allocate a set of nodes and launch a parallel task directly | srun --nodes=N task |
Abort a job that is running or waiting for execution | scancel jobid |
Abort all jobs of a user | scancel --user=usernameor generally scancel --user=$USER |
Put a job on hold (i.e. pause waiting) and Release a job from hold (These related commands are rarely used in standard operation.) |
scontrol hold jobidscontrol release jobid |
The major command line options that are used for sbatch
and salloc
are listed in the Table below. These options can also be specified for srun
, if srun
is not used in the context of nodes previously allocated via sbatch
or salloc
.
Specification | Option | Comments |
---|---|---|
Number of nodes requested | --nodes=N | |
Number of tasks to invoke on each node | --tasks-per-node=n | Can be used to specify the number of cores to use per node, e.g. to avoid hyper-threading. (If option is omitted, all cores and hyperthreads are used; Hint: using hyperthreads is not always advantageous.) |
Partition | --partition= partitionname | SLURM supports multiple partitions instead of several queues |
Job time limit | --time=time-limit | time-limit may be given as minutes or in hh:mm:ss or d-hh:mm:ss format (d means number of days) |
Output file | --output=out | Location of stdout redirection |
For the sbatch
command these options may also be specified directly in the job script using a pseudo comment directive starting with #SBATCH
as a prefix. The directives must precede any executable command in the batch script:
#!/bin/bash
#SBATCH --partition=std
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --time=00:10:00
...
srun ./helloParallelWorld
A complete list of parameters can be retrieved from the man
pages for sbatch
, salloc
, or srun
, e.g. via
Monitoring job and system information
There are four commands for monitoring job and system information:
-
- shows current information about nodes and partitions for a system managed by SLURM. Command line options can be used to filter, sort, and format the output in a variety of ways. By default it essentially shows for each partition if it is available and how many nodes and which nodes in the partition are allocated or idle (or are possibly in another state like down or drain, i.e. not available for some time). This is useful for the user e.g. to decide in which partition to run a job. The number of allocated and idle nodes indicates the actual utilization of the cluster.
-
- shows current information about jobs in the SLURM scheduling queue. Command line options can be used to filter, sort, and format the output in a variety of ways. By default it lists all pending jobs, sorted descending by their priority, followed by all running jobs, sorted descending by their priority. The major job states are:
- R for Running
- PD for Pending
- CD for Completed
- F for Failed
- CA for Cancelled
The
TheTIME
column shows for running jobs their execution time so far (or 0:00 for pending jobs).NODELIST (REASON)
column shows either on which nodes a job is running or why the job is pending. A job is pending for two main reasons:- it is still waiting for resources to become scheduled, shown as
(Resources)
, - its priority is still not sufficient for it to become executed, shown as
(Priority)
, i.e. there are other jobs with a higher priority pending in the queue.
The position of a pending job in the queue indicates how many jobs are executed before and after it. The
squeue
command is the main way to monitor a job and can e.g. also be used to get the information about the expected starting time of a job (see Table below).
- shows current information about jobs in the SLURM scheduling queue. Command line options can be used to filter, sort, and format the output in a variety of ways. By default it lists all pending jobs, sorted descending by their priority, followed by all running jobs, sorted descending by their priority. The major job states are:
-
- is mainly used to display various status information of a running job taken as a snapshot. The information relates to CPU, task, node, Resident Set Size (RSS), and virtual memory (VM), etc.
-
- is mainly used by cluster administrators to view or modify the configuration of the SLURM system, but it also offers users the possibility to get some information about the cluster configuration (e.g. about partitions, nodes, and jobs).
The Table below lists basic user activities for job and system monitoring and the corresponding SLURM commands.
User activity | SLURM command |
---|---|
View information about currently available nodes and partitions. The state of a partition may be UP , DOWN , or INACTIVE . If the state is INACTIVE , no new submissions are allowed to the partition. |
sinfo [--partition=partitionname] |
View summary about currently available nodes and partitions. The NODES(A/I/O/T) column contains corresponding number of nodes being allocated, idle, in some other state and the total of the three numbers. |
sinfo -s |
Check the state of all jobs. | squeue |
Check the state of all own jobs. | squeue --user=$USER |
Check the state of a single job. | squeue -j jobid |
Check the expected starting time of a pending job. | squeue --start -j jobid |
Display status information of a running job (e.g. average CPU time, average Virtual Memory (VM) usage – see sstat --helpformat and man sstat for information on more options). |
sstat --format=AveCPU, AveVMSize -j jobid |
View SLURM configuration information for a partition cluster node (e.g. associated nodes). | scontrol show partition partitionname |
View SLURM configuration information for a cluster node. | scontrol show node nodename |
View detailed job information. | scontrol show job jobid |
Retrieving accounting information
There are two commands for retrieving accounting information:
-
- shows accounting information for jobs and job steps in the SLURM job accounting log or SLURM database. For active jobs the accounting information is accessed via the job accounting log file. For completed jobs it is accessed via the log data saved in the SLURM database. Command line options can be used to filter, sort, and format the output in a variety of ways. Columns for jobid, jobname, partition, account, allocated CPUs, state, and exit code are shown by default for each of the user’s jobs eligible after midnight of the current day.
-
is mainly used by cluster administrators to view or modify the SLURM account information, but it also offers users the possibility to get some information about their account. The account information is maintained within the SLURM database. Command line options can be used to filter, sort, and format the output in a variety of ways.
The Table below lists basic user activities for retrieving accounting information and the corresponding SLURM commands.
User Activities for Retrieving Accounting Information
(user supplied information is given in italics)User Activity SLURM Command View job account information for a specific job. sacct
-j jobidView all job information from a specific start date (given as yyyy-mm-dd
).sacct
-S startdate -u $USERView execution time for (completed) job (formatted as days-hh:mm:ss
, cumulated over job steps, and without any header).sacct
-n -X -P -o Elapsed -j jobid
Examples
Submitting a batch job
Below an example script for a SLURM batch job – in the sense of a hello world program – is given. The job is suited to be run in the Hummel HPC cluster at Regional Computing Center / Regionales Rechenzentrum der Universität Hamburg (RRZ). For other cluster systems some appropriate adjustments will probably be necessary.
#!/bin/bash
# Do not forget to select a proper partition if the default
# one is no fit for the job! You can do that either in the sbatch
# command line or here with the other settings.
#SBATCH --job-name=hello
#SBATCH --nodes=2
#SBATCH --tasks-per-node=16
#SBATCH --time=00:10:00
# Never forget that! Strange happenings ensue otherwise.
#SBATCH --export=NONE
set -e # Good Idea to stop operation on first error.
source /sw/batch/init.sh
# Load environment modules for your application here.
# Actual work starting here. You might need to call
# srun or mpirun depending on your type of application
# for proper parallel work.
# Example for a simple command (that might itself handle
# parallelisation).
echo "Hello World! I am $(hostname -s) greeting you!"
echo "Also, my current TMPDIR: $TMPDIR"
# Let's pretend our started processes are working on a
# predetermined parameter set, looking up their specific
# parameters using the set number and the process number
# inside the batch job.
export PARAMETER_SET=42
# Simplest way to run an identical command on all allocated
# cores on all allocated nodes. Use environment variables to
# tell apart the instances.
srun bash -c 'echo "process $SLURM_PROCID \
(out of $SLURM_NPROCS total) on $(hostname -s) \
parameter set $PARAMETER_SET"'
The job script file above can be stored e.g. in $HOME/hello_world.sh
($HOME
is mapped to the user’s home directory). One peculiarity of the Hummel cluster is that the user’s home directory is write protected for a batch job to avoid storing batch job results unintentionally there.
A corresponding working (sub-) directory can be created before the hello_world.sh
job is submitted:
[exampleusername@node001 14:48:33]~$ mkdir $WORK/hello_workdir
[exampleusername@node001 14:48:33]~$ cd $WORK/hello_workdir
The working directory $WORK
of a user is a location that is accessible to all nodes of a user’s job.
The job is submitted to SLURM’s batch queue using the default value for partition (scontrol
show partitions (also see above) can be used to show that information):
The output of sbatch
will contain the jobid, like 123456 in this example.
During execution the output of the job is written to a file, named slurm-123456.out
in this example:
[exampleusername@node001 14:48:33]~$ cat slurm-123456.out
module: loaded site/slurm
module: loaded site/tmpdir
module: loaded site/hummel
module: loaded env/system-gcc
Hello World! I am node223 greeting you!
Also, my current TMPDIR: /scratch/exampleusername.123456
process 8 (out of 32 total) on node223 parameter set 42
process 15 (out of 32 total) on node223 parameter set 42
process 4 (out of 32 total) on node223 parameter set 42
process 5 (out of 32 total) on node223 parameter set 42
process 9 (out of 32 total) on node223 parameter set 42
process 7 (out of 32 total) on node223 parameter set 42
process 3 (out of 32 total) on node223 parameter set 42
process 6 (out of 32 total) on node223 parameter set 42
process 11 (out of 32 total) on node223 parameter set 42
process 2 (out of 32 total) on node223 parameter set 42
process 13 (out of 32 total) on node223 parameter set 42
process 12 (out of 32 total) on node223 parameter set 42
process 1 (out of 32 total) on node223 parameter set 42
process 10 (out of 32 total) on node223 parameter set 42
process 0 (out of 32 total) on node223 parameter set 42
process 14 (out of 32 total) on node223 parameter set 42
process 28 (out of 32 total) on node224 parameter set 42
process 23 (out of 32 total) on node224 parameter set 42
process 26 (out of 32 total) on node224 parameter set 42
process 27 (out of 32 total) on node224 parameter set 42
process 30 (out of 32 total) on node224 parameter set 42
process 19 (out of 32 total) on node224 parameter set 42
process 18 (out of 32 total) on node224 parameter set 42
process 22 (out of 32 total) on node224 parameter set 42
process 25 (out of 32 total) on node224 parameter set 42
process 17 (out of 32 total) on node224 parameter set 42
process 29 (out of 32 total) on node224 parameter set 42
process 21 (out of 32 total) on node224 parameter set 42
process 24 (out of 32 total) on node224 parameter set 42
process 16 (out of 32 total) on node224 parameter set 42
process 31 (out of 32 total) on node224 parameter set 42
process 20 (out of 32 total) on node224 parameter set 42
If there had been errors (i.e. any output to the stderr stream) a corresponding file named slurm-123456.err would have been created.
Interactive usage under control of the batch system
Interactive sessions under control of the batch system can be created via salloc
. salloc
differs from sbatch
by the fact that resources are initially only reserved (i.e. allocated) without executing a job script. Also, the session is running on the node on which salloc
was invoked (but not on a compute node in contrast to submission with sbatch
). This is often useful during the interactive development of a parallel program.
A single node is reserved for interactive usage as follows:
When the resources are granted by SLURM, salloc
will start a new shell on the (login or head) node where salloc
was executed. This interactive session is terminated by exiting the shell or by reaching the time limit.
An OpenMP program using \(N\) threads, for example, can be started on the allocated node as follows:
[exampleusername@node001 14:48:33]~$ export OMP_NUM_THREADS=N
[exampleusername@node001 14:48:33]~$ srun my-openmp-binary
To start an interactive parallel MPI program \(N\) nodes can be allocated as follows:
The MPI Program using \(n=32\) processes, for example, can be started on the allocated nodes as follows:
Another way to use the allocated nodes is to use ssh to establish connections to them.
TORQUE
Since PBS and TORQUE offer similar features as SLURM it is appropriate to show the corresponding commands and options in Tables. A good overview to this topic is also given e.g. at the Division of Information Technology at the University of Maryland.
User Activity | PBS/TORQUE | SLURM |
---|---|---|
Submitting a job | qsub jobscript |
sbatch jobscript |
Using nodes interactively | qsub -I [options] |
salloc [options] |
Deleting or canceling a job | qdel jobid |
scancel jobid |
Viewing job status | qstat jobid |
squeue -j jobid |
Viewing all of one’s own job status | qstat -u $USER |
squeue -u $USER |
Viewing all jobs in the queue | qstat [-a] |
squeue |
Checking expected starting time of a job | squeue --start -j jobid |
Additional job parameters are listed below. These options are typically used in job scripts using a pseudo comment directive starting with a special prefix. The directives must precede any executable command in the batch script:
Job Parameter | PBS/TORQUE | SLURM |
---|---|---|
Directive | #PBS | #SBATCH |
Job name | -N name | --name=name |
Queue/Partition | -Q queuename | --partition= partitionname |
Nodes | -l nodes=n | --nodes=n |
Processes per node | -l ppn=n | --tasks-per-node=n |
CPUs per task | -l ompthreads=n | --cpus-per-task=n |
Job time limit | -l walltime=seconds -l walltime=hh:mm:ss |
--time=minutes --time=hh:mm:ss |
stdout redirection to a file | -o filename | --output=filename |
stderr redirection to a file | -e filename | --error=filename |
stdout and stderr redirection to one file | -oe filename | --output=filename and not specifying --error=filename |
Email address | -m address | --mail-user=address |
Notify on state change of job | -m b -m e -m a -m abe |
--mail-type=BEGIN --mail-type=END --mail-type=FAIL --mail-type=ALL |
Do not permit the job to be requeued after a node failure | -r n | --no-requeue |
During job runtime several environment variables are automatically set:
Content of Environment Variable | PBS/TORQUE | SLURM |
---|---|---|
JobId | $PBS_JOBID | $SLURM_JOBID |
Working directory from which the job was submitted | $PBS_O_WORKDIR | $SLURM_SUBMIT_DIR |
List of allocated nodes | $PBS_NODEFILE (a filename) |
$SLURM_JOB_NODELIST (node list itself) |