Domain Decomposition
GSWHC-B Getting Started with HPC Clusters K3.4-B Domain Decomposition
Relevant for: Tester, Builder, and Developer
Description:
You will learn about the typical strategy for decomposing work into parallel tasks that is used for parallelizing simulation programs in science end engineering (basic level)
You will learn about the surface to volume ratio which is important to understand the performance impact of a decomposition (basic level)
This skill requires no sub-skills
Level: basic
Domain Decomposition
Domain decomposition is a technique for parallelizing programs that perform simulations in engineering or natural sciences. Such a technique is needed on distributed memory systems. On distributed memory systems the computational work and, in general, the data have to be decomposed to enable parallel computations that employ several compute processes (which implies the possibility to run on multiple compute nodes).
Many computer simulations are defined on meshes that arise from discretizations of partial differential equations. Another class of simulations are particle simulations (e.g. of many molecules or stars that interact), where the particles can move in a box. The box or the mesh define a geometric region. In a domain decomposition this region is decomposed: a box is split into smaller ones and a mesh is decomposed into smaller parts. The sub-domains are assigned to processes. Each process updates only the data that belongs to its sub-domain.
In order to update a variable that is defined on a site of a mesh, or for a particle, in general data from neighbouring sites or particles is needed. Typically, the neighbour region is small. It is given by the extent of a stencil or the finite range of interactions of molecules (exceptions are gravitational and electrical forces which have infinite range).
The essential point is, that some neighbour regions expand beyond the sub-domain of their process, i.e. neighbour regions are partly stored on remote processes. Those parts must be made available on the local process before updating can begin. Data from remote processes are stored locally in so called halo regions. Halo regions store data from remote processes that is needed to perform local operations. The corresponding data communication operation is called halo exchange or exchange of boundary values (“exchange” because data movement is typically necessary in either direction).
The fact that halo exchange is needed in parallel computer simulations has a performance impact, that everybody, who is running such simulations, should know about. Halo exchange is one kind of parallel overhead. The amount of overhead is proportional to size and shape of the halo region which is approximately the size of the surface of a sub-domain. The size of the surface of a sub-domain has to be related to the volume of the subdomain (which is approximately proportional to the amount of work that needs to be performed). The relative overhead is approximately proportional to the surface to volume ratio of a sub-domain.
To understand this relative overhead quantitatively it is instructive to consider sub-domains that are -dimensional cubes that have linear extension . The size of their surface is and size of their volume is . The surface to volume ratio, i.e. the relative overhead,
is inversely proportional to . In other words this kind of overhead increases at the same rate as sub-domains shrink. This effect limits (strong) scaling: if more processes are used, sub-domains become smaller and the overhead grows. At some point it becomes unacceptably large.
Besides the size of the volume its shape also plays a role. In order to see this one can look at rectangular sub-domains. Let the dimensions of the rectangular be and . Then the size of the volume of the rectangular is and the size of its surface is . The overhead is proportional to
For the proportionality factor , for it is for large values of . Hence, the overhead of a rectangular shape can be up to a factor of larger in comparison to a quadratic shape. The ideal shape is a sphere, because it has the smallest surface to volume ratio. Long narrow sub-domains are disadvantageous.
Optimizing surface to volume ratios is a nested process if the communication hardware is hierarchical. Today this is very often the case: compute nodes can have NUMA domains and the communication networks can be hierarchical. Then good surface to volume ratios are desirable at all hardware levels: at process (core) level, at NUMA domain level, at node level, at the level of nodes connected to same switch, etc.