Job Scheduling
GSWHC-B Getting Started with HPC Clusters \(\rightarrow\) K4-B Job Scheduling
Relevant for: Tester, Builder, and Developer
Description:
You will learn about how workload managers control the unattended background execution of programs or jobs, respectively, by the help of job queues (basic level)
You will learn about typical scheduling principles (e.g. first come first served, shortest job first) to achieve objectives like minimizing the averaged elapsed program runtimes, and maximizing the utilization of the available HPC resources (basic level)
This skill requires no sub-skills
Level: basic
Motivation
Users of compute centers typically compete for the expensive HPC resources of cluster systems.
HPC resources can be distinguished as
shared resources (e.g. a parallel file system that is often shared across all cluster nodes and therefore shared between all users),
not-shared resources (e.g. cluster nodes dedicated to a particular parallel program of an individual user).
The configuration of the cluster system matters as well: a cluster node can also be a resource that is shared between several users.
A major aspect of job scheduling is to manage these resources in a way that users are treated fairly. Accounting for users or user groups can additionally support this.
Two questions that are important to users are constantly appearing:
When will my job (or my parallel program, respectively) start?
Why is my job still not running at the estimated starting time?
Basic knowledge about job scheduling which is provided below shall give an insight into batch systems and enable users to answer such questions.
Batch Systems vs. Time Sharing Systems
Time Sharing Systems
A typical computer system like a PC provides the possibility for interactive communication between the user and the system. The user gives instructions via a keyboard and/or a mouse to the operating system or directly to a program, and the response is immediately displayed on the monitor. Such interactive systems are characterized by the users’ desire for a short response time.
The user can easily experiment and see immediate results, e.g. during interactive program development with the typical edit \(\rightarrow\) build \(\rightarrow\) test cycle (build comprises the steps compile and link). In a typical cluster system such activities are performed simultaneously by a number of users on nodes explicitly provided for that purpose, e.g. login or head nodes. The Linux operating system (nowadays used in almost all cluster systems) provides a time sharing environment, such that all users logged in to a specific node (e.g. in to the same login or head node) have the impression of using a dedicated computer. Nevertheless, it is important to keep in mind that login or head nodes are shared resources and all users are advised to use them carefully. It is e.g. perfectly alright to perform a CPU intensive but short build step, but it is e.g. not alright to start a user-specific parallel program on several cores – not even for testing purposes – on a login or head node.
A cluster node is characterized by running its own instance of the operating system on one or more of its CPUs. Because of the nowadays common multi-core CPU architectures the terms core and processor – in its classical meaning of a physical unit capable of executing a program – are used synonymously here.
Simply put, on a single processor machine the scheduling mechanism of the operating system ensures that each of \(n\) CPU intensive processes has its own processor running at \(1/n\) times the speed of the real processor. For a multi-core cluster node (neglecting for the sake of simplicity hyper-threading and varying CPU clock rates with features like turbo boost) the same applies analogously: for a sufficiently large \(n\) on a machine with \(c\) cores (i.e. \(n > c\)) the pseudo processor speed available for a process would be \(c/n\) of a real core (if \(n \leq c\), each process can run at the speed of a real core). For an in-depth introduction to operating system topics like process management, memory management, and storage management refer e.g. to Operating Systems Concepts (ninth edition) by Silberschatz, Galvin, and Gagne. Scheduling implementations for time sharing systems shall not be discussed in more detail here. The topic of this learning unit is scheduling in batch systems.
Batch Systems
A batch job mainly consists of a program or a set of programs being processed by the computer system. One major feature of a batch system is the lack of interaction between the user and the job. The programs to run are typically given by a sequence of operating system commands in a batch file. The user submits the batch file to the batch system for execution. In a cluster system there are (hopefully) more batch jobs submitted than can be executed immediately. Submitted batch jobs are queued for later execution by the batch system.
Basically, running jobs on a cluster only begins to be meaningful when a problem cannot be solved using a desktop PC within a few days. Batch systems are used for executing large parallel jobs which need no interaction.
Job Scheduling
Scheduling is the process of selecting and allocating resources to a job waiting for execution. Queues are used to handle this. As a consequence, waiting times will typically arise before a job is executed.
Linux itself is not a batch system. Workload managers like SLURM or TORQUE are used to provide this functionality in a Linux cluster system.
Scheduling is fundamental, since almost all computer resources are scheduled before use. Job scheduling deals with the problem of deciding which jobs are allocated in which order to cluster nodes. Normally, job scheduling is heavily influenced by resource allocation considerations as specified at job submission time e.g. by the number of nodes (or processors, respectively) required for a job, estimated job runtime limit, etc.
There are many different scheduling algorithms that can be used to achieve different objectives. A typical goal is to minimize the average waiting time for a given set of jobs.
Before scheduling queues are presented in more detail, a set of terms in the sense of performance criteria shall be introduced:
Resource Utilization: For expensive resources like CPUs utilization should be high.
Throughput: A measure of work being done: the number of jobs completed per time unit. For this consideration the jobs can be additionally grouped into different classes, in particular with regard to their runtimes: for long jobs this rate may be 20 jobs per day, for short jobs throughput might be 40 jobs per hour.
Waiting Time: Delay after job submission before the job starts executing.
Execution Time: The time between job execution start and job completion.
Turnaround Time: Waiting time plus execution time. In other words it is the delay between job submission and job completion. Since scheduling does not really affect the execution time of a job, attention is frequently focused on waiting time rather than turnaround time.
Goals of job scheduling are
maximization of resource utilization,
maximization of throughput,
minimization of waiting time,
minimization of turnaround time.
Scheduling Algorithms
In order to achieve these goals the following scheduling algorithms may be employed.
First-Come-First-Served (FCFS)
This very simple algorithm is implemented with a First-In-First-Out (FIFO) queue and as its name indicates, jobs are executed in the order of submission. The performance of FCFS is quite poor and it is not directly suitable for job scheduling. The following example, by analogy, can illustrate this: In a supermarket a customer with a well-filled shopping cart is the first to arrive at the checkout. The next three customers arriving have just one banana in their shopping cart. Obviously, the average waiting time with regard to the four customers will be quite high.
However, FCFS is the basis for more sophisticated algorithms.
Shortest-Job-First (SJF)
This algorithm associates an estimated runtime with each job and selects the job with the smallest runtime available (i.e. the shortest job) in the job queue for next execution. To stay with the supermarket example, the waiting time of the customers with just a banana in their shopping cart are greatly reduced if the customer with the very well-filled shopping chart gives them precedence.
SJF is trivially provably optimal with respect to minimizing the average waiting time for a given set of jobs: If a short job is moved before a long job, its waiting time is decreased to a greater extend than the waiting time for the long job is increased.
For the scheduler there is no way of estimating the runtime of a job itself. To reduce this problem, in batch environments the user typically estimates the maximum runtime of his job and specifies this value as a job time limit when the job is submitted. A job reaching the time limit and still being executed is aborted by the batch system. The accurate setting of the job time limit is in the user’s own interest and already being exploited in early batch systems to increase the throughput.
A problem – named starvation – may occur if short jobs, which are constantly being submitted, are giving precedence all the time, so the execution of long jobs will not start. Starvation describes the situation of blocking a job that is ready to be run but waiting indefinitely.
In accordance to SJF it was common practice in early batch systems to use several job queue classes to get high throughput. There might for example be FIFO queues for short jobs, medium jobs, and long jobs, with maximum runtimes of 5 minutes, 30 minutes, or any desired time span. To draw the analogy with the supermarket example once more, the queue for short jobs could be represented by an express checkout with no more than 3 items allowed per customer.
Priority
A Priority is associated with each job and the resources are allocated to the job with the highest priority. SJF can be regarded as a special case of priority scheduling.
In practice, priorities are defined by value ranges whereby internal and external priorities can be distinguished.
- Internal priorities arise e.g. from
- job size, based e.g. on
- number of nodes on which to run the program
- time limits
- memory limits
- aging, for which the priority of each job waiting in the system is steadily increased as time goes on
- other required resources like licenses
- job size, based e.g. on
- External priorities, as external criteria to the operating system, arise e.g. from
- (scientific) institution sponsoring the (scientific) work
- amount of funds being paid for computer use
- type of funds being paid for computer use
- other, mostly political, factors
The priority of a job affects its position in the job queue. The higher the priority of a newly submitted job, the nearer will it be inserted to the top of the queue. Further job submissions in combination with possibly changing job priorities over time will lead to a very dynamic scheduling system. Information requested by the user about the status of the system, for example starting time of a job, must therefore be considered as a rough estimate.
Fair-Share
Priorities, as described above, offer very versatile options for ordering waiting jobs in a FIFO-based job queue.
Besides criteria like job size and aging, a technique named Fair-Share can be used to take into account the history of previously submitted jobs by a user or, respectively, users belonging to the same group: For calculating the priority for an actually submitted job it will be considered how many of the computing resources (e.g. computing time), which were initially granted to the user or its user group, have already been consumed. The fewer resources have been consumed in the past the higher the job priority is set (and vice-versa). Additionally, the stored history of jobs could be considered in order to decrease the impact of the really distant past. Jobs that are older than some threshold could e.g. be regarded as “forgotten”. The overall goal is to treat all users fairly (i.e. to obtain certain shares of the resources).
Backfilling
What could be observed by a user in a batch system is that there seems to be a sufficient number of nodes available for its own waiting job, but it still does not start. The reason might be that a higher prioritized bigger job at the top of the queue, with demands for a number of nodes not actually available, is also waiting and the scheduler has reserved already some idle nodes for the bigger job.
In order to mitigate this problem a technique named backfilling can help to increase resource utilization and to decrease average waiting times by allowing a small job with a low priority to run before of a bigger job with a higher priority.
Simply put, in a situation where the scheduler has already reserved some idle nodes for the next big job at the top of the queue, it calculates a backfilling time window based on the job time limits of still running jobs to estimate the starting time of the big job. If, for example, the big job will not start within the next 15 minutes, small jobs with a job time limit of at most 15 minutes and for which the number of reserved idle nodes is sufficient can be started immediately in the backfill window without influencing the planned start time of the big job.