Performance Frontiers
The three local compute centers DKRZ, RRZ, and TUHH RZ, operate typical Linux-based HPC cluster systems. Parallel programs are required to exploit the bundled computing power of the cluster nodes. Thus the runtimes of the parallel program can be reduced in comparison to the respective sequential program version running on a single processor. Cluster nodes can be essentially considered as PCs (Personal Computers) with powerful CPUs (Central Processing Units) interconnected with a particularly powerful high bandwidth and low latency network (based on the InfiniBand technology in the present case). Because of the nowadays widespread so-called multi-core CPU architectures the terms core and processor – in its classical meaning of a physical unit capable of executing a program – are used synonymously on the HHCC website.
A key measure for the performance of such HPC systems are so-called floating point operations per second (FLOPS), which are in the order of several PetaFLOPS for the top HPC systems of 2017. One PetaFLOPS (PFLOPS) corresponds to 1015 FLOPS or 103 TeraFlops (TFLOPS), respectively. In comparison, a stand-alone computer, like a powerful PC, achieves a performance of about 1 TeraFLOPS. For HPC systems, the renowned website of the TOP 500 list of HPC systems provides a good overview of the observed performance in the past and the expected increase of computing power in the future.
Moore's Law
In simple terms, Moore’s law from 1965, revised in 1975, states that the complexity of integrated circuits and thus the computing power of CPUs for HPC systems, respectively, doubles approximately every two years. In the past that was true. However, for some time it has been observed that this increase in performance gain is no longer achieved through improvements of processor technology in a sequential sense – CPU clock rates, for instance, have not been increased notably for several years –, but rather by using many cores for processing a task in parallel. Therefore parallel computing and HPC systems will become increasingly relevant in the future.
Speedup, Efficiency, and Scalability
Simply put, Speedup defines the relation between the sequential and parallel runtime of a program. Given a sequential runtime T1 on a single processor and a parallel runtime Tn for n processors the speedup is given by
Sn = T1 / Tn.
In the ideal case, the speedup increases linearly with the number of processors, i.e. Sn = n.
In practice, however, a linear speedup is not achievable due to overheads for synchronization (e.g. for waiting for partial results) and communication (e.g. for distributing partial tasks and collecting partial results).
Efficiency is defined accordingly by
En = Sn / n
If the efficiency remains high when the number of processors is increased, this is also called a good scalability of the parallel program.
For a problem that can be parallelized trivially, e.g. rendering (independent) computer animation images, a nearly linear speedup will be achieved also for a larger number of processors. However, there are algorithms having a so-called sequential nature, e.g. alpha-beta game-tree search, that have been notoriously difficult to parallelize. Typical problems in the field of scientific computing are somewhere in-between these extremes. In general, the special challenge is to achieve good speedups and good efficiencies. Another important aspect is to use the best known sequential algorithm for comparisons in order to get fair speedup results.
Amdahl's Law
In simple terms, Amdahl’s law from 1967 states that there is an upper limit for the maximum speedup achievable with a parallel program, which is determined by its sequential, i.e. non-parallelizable part, e.g. for I/O operations, or more generally, for synchronization and communication overheads.
Here is a simple example to illustrate this:
A program has a sequential runtime of 20 hours on a single core and contains a non-parallelizable part of 10% or 2 hours, respectively. Even if the non-sequential part of 18 hours can be parallelized extremely well, the total runtime of the program will be at least 2 hours. I.e. the maximum speedup is limited by 20 hours / 2 hours = 10, regardless how many cores are available for the parallelization. If 32 cores were used and the parallelizable runtime part reduced to ideally 18 hours / 32 = 0,56 hours, the total runtime would be at least 2,56 hours, i.e. the speedup would be
S32 =20 hours / 2,56 hours = 7,81
and the efficiency would only be
E32 = S32 / 32 = 7,81 / 32 = 24,41%.