Parallelization Overheads
GSWHC-B Getting Started with HPC Clusters \(\rightarrow\) K3.3-B Parallelization Overheads
Relevant for: Tester, Builder, and Developer
Description:
You will learn about overheads for communication and synchronization that are introduced by parallelization (basic level)
You will learn about other sources of parallel inefficiency: load imbalances, hardware effects (basic level)
This skill requires no sub-skills
Level: basic
Parallelization overhead and other sources of parallel inefficiency
Parallelization of a program always introduces some extra work in addition to the work done by the sequential version of the program. The main sources of parallelization overhead are data communication (between processes) and synchronization (of processes and threads). Other sources are additional operations that are introduced at the algorithmic level (for example in global reduction operations) or at a lower software level (for example by address calculations).
Data communication is necessary in programs that are parallelized for distributed memory computers (if data communication is not necessary the program is called trivially or embarrassingly parallel). The communication effort depends on the communication pattern. Examples for communication patterns are the exchange of data in halo regions (this is typical for simulation programs that are based on discretized partial differential equations and are parallelized by a domain decomposition) and global reduction operation (for example summing up numbers from all processes, or obtaining a minimal or maximal value). In these examples additional operations at the algorithmic level (in global reduction operations) and at the software level (extra address calculations for accessing data in halo regions) take only little time in comparison to the communication (because the latter involves the network).
For programs running with shared memory parallelization synchronization plays an important role. Synchronization means that threads have to wait for the completion of other threads because they need data that is processed by those threads. Overhead is also caused by assigning work to threads (e.g. for executing loops in chunks), and by reduction operations.
Other sources of parallel inefficiency are parts of a program that were not parallelized and still run serially (serial parts), and unbalanced load. In both cases some hardware is not used for some time, while the goal is to use all hardware all the time.
There are two hardware effects that can reduce the efficiency of the execution of shared memory parallel programs: NUMA and false sharing. NUMA can lead to a noticeable performance degradation if data locality is bad (i.e. if too much data that a thread needs is not in its NUMA domain). False sharing occurs if threads process data that is in the same data cache line. Effectively, this can lead to serial execution or even take longer than explicitly serial execution of the affected piece of code.