Getting Started with HPC Clusters
GSWHC-B Getting Started with HPC Clusters
This skill requires the following sub-skills
- K1.1-B System Architectures
- K1.2-B Hardware Architectures
- K1.3-B I/O Architectures
- K2-B Performance Modeling
- K3.3-B Parallelization Overheads
- K3.4-B Domain Decomposition
- K4-B Job Scheduling
- USE1-B Use of the Cluster Operating System
- USE2.1-B Use of a Workload Manager
- PE3-B Benchmarking
Level: basic
Introduction – What is HPC?
A tautological definition of HPC is: “You are doing HPC when you are using HPC hardware.” HPC hardware is needed whenever your personal computer (or workstation) becomes to small and/or to slow to complete your computing tasks. Powerful hardware is the common denominator. However, “there is no free lunch”, when it comes to speeding up computations. Because a single core delivers roughly the same compute power on the personal computer and on an HPC system, no significant speedup can be expected without employing some form of parallel computing.
HPC stands for high-performance computing. The goal of traditional HPC is to run computer simulations in natural sciences and engineering as fast as possible. Typically, computer simulations need a lot of aggregated computing power to accomplish a single task. In order to achieve this, parallelization at the task level is a central objective (in addition to achieving high performance at the sequential level). Performance is measured in double-precision floating-point operations per second (abbreviated as FLOPS or Flop/s).
The highest performance can be achieved with supercomputers that have very many powerful compute nodes and a powerful communication network. The communication network is essential to scale a single instance of an HPC application up to many nodes. This kind of computing is sometimes called capability computing. In contrast, capacity computing refers to the number of instances that can be run at the same time. In practice, HPC projects have capability and capacity aspects.
Most newcomers to HPC systems are attracted by the capacity of these systems. Usually, their overall computing needs are small compared with those in traditional HPC. As a consequence they do not care about the actual performance of their (serial or parallel) programs. While this makes sense, if the overall time of machine usage is small (and the effort to speedup a process is more expensive than just running it), one might not call this kind of machine usage high-performance computing as well. Nonetheless, HPC systems deliver performance to those users in terms of a different metric: shorter time-to-solution (higher throughput) rather than Flop/s. In fact, many applications perform high-performance searching (e.g. in gene sequencing) leading to a third performance metric.
HPC systems are sometimes called Linux clusters. They have similar software environments (at least those that are used for academic research) that put some demands on users:
- the operating system is GNU/Linux
- interactive access is limited
- graphical user interfaces are unusual
- the command line has to be used
- a batch system has to be used
- batch jobs are being prepared and managed from the command line
- batch jobs have to be formulated as shell scripts
- job inputs must be prepared beforehand
In addition:
- parallelization is needed in order to significantly speed up computations
- the basics of parallel computing must be understood
- parallel performance needs to be checked: is the runtime (almost) n times shorter when n times as many compute cores are used?
The goal of this collection of texts is to provide short introductions to these topics to newcomers to HPC.