Hardware Architectures

INDEX

GSWHC-B Getting Started with HPC Clusters \(\rightarrow\) K1.2-B Hardware Architectures

Relevant for: Tester, Builder, and Developer

Description:

You will learn about parallel computer architectures, in particular: the distinction between shared and distributed memory systems (basic level)

This skill requires no sub-skills

Level: basic

Parallel Computer Architectures

HPC computer architectures are parallel computer architectures. A parallel computer is built out of

compute units,
main memory,
and a high speed network.

Compute units

Compute units can be:

Traditional CPUs. Although CPU stands for Central Processing Units, there is no central, i.e. single, processing unit any more. Today, all CPUs have multiple compute cores which all have the same functionality.

GPUs (Graphical Processing Units) or GPGPUs (General Purpose Graphical Processing Units). Originally, GPUs were used for image processing and displaying images on screens. Then people started to utilize the compute power of GPUs for other purposes and special GPUs were built for computing purposes only, they even cannot be connected to any display. In particular, GPGPUs can do double precision floating point arithmetic in hardware and can be equipped with ECC (Error Correcting) memory.

Special devices like FPGAs (Field-Programmable Gate Arrays) or vector computing units. FPGAs are devices that have configurable hardware. Configurations are specified by hardware description languages. With this respect configuring FPGAs is closer to designing hardware than to well-known programming. FPGAs are interesting if one uses them to implement hardware features that are not available in CPUs or GPUs. A prominent example is low precision arithmetic that needs only a few bits. Vector units are successors of vector computers (i.e. the first generation of supercomputers). They are supposed to provide higher memory bandwidth than CPUs.

Main-memory architecture

At an abstract level the high speed network connects compute units and main memory. This view leads to three main parallel computer architectures:

shared memory
distributed memory
NUMA (Non-Uniform Memory Access)

Shared memory

In a shared memory system all compute units can directly access the whole main memory. In practice this means that a shared memory system is a single computer. Today this is the standard computer architecture. Desktops, laptops, and even mobile telephones have more that one compute core that share the same memory. Typically, those systems have a single CPU socket. This implies that memory access is symmetric, i.e. the time it takes to access any memory address is the same for all compute units. Such systems are called SMPs (Symmetric Multiprocessor systems). In the past SMPs were built with multiple (single-core) CPU sockets. Some years ago two-socket (multi-core) compute nodes were still SMPs. Today NUMA (Non-Uniform Memory Access, see below) is normal. The consequence of NUMA is that in order to achieve best performance these computers must be used in such a way that most memory access is to the local NUMA domain (data locality).

The advantage of a shared memory system is that programming parallel applications in general needs less effort than programming for distributed memory systems. The disadvantage is that scaling is limited to the size of a single shared memory node.

Distributed memory

The prime example of a distributed memory system is a PC cluster: individual computers are connected with a network. In a distributed memory system each compute node can only access its own memory directly. However, two nodes can exchange data via the network. In principle there is no difference between PC and HPC clusters except that nodes and network of an HPC cluster are more powerful. Clusters are built out of individual computer boxes. In the past high end distributed memory systems were higher integrated, i.e. several compute nodes were integrated on a single backplane.

In general, the effort for programming parallel applications for distributed systems is higher than for shared memory systems. Distributed memory systems can be huge and make it possible to use hundreds of thousands of compute cores to run a single application. Sometimes this kind of processing is called massively parallel processing (MPP). However, a classic MPP system is more specialized than a cluster: the operating system is kept simpler (by employing micro kernels) and batch scripts typically do not run on the actual compute nodes but rather on a host.

NUMA

A NUMA (Non-Uniform Memory Access) system combines properties from shared and distributed memory systems. At the hardware level a NUMA system resembles a distributed memory: it is build out of nodes that have their own memory. However, there is additional hardware that enables to directly access (read from and write to) memory of other nodes. In principle there is a global address space that spans the memory of all nodes. Such systems can be programmed like shared memory systems or like distributed memory systems, or a combination of both.

Meanwhile, big NUMA machines are not being built any more. However, NUMA became a feature of compute nodes that have more than one CPU socket. Even a single socket that has many cores can have NUMA properties. NUMA adds a complication to running parallel applications at the node level: as mentioned above is important to exploit data locality.