Goals
It is important to raise awareness and knowledge of users for performance engineering, i.e., to assist in identification and quantification of potential efficiency improvements in scientific parallel codes and parallel code usage. Although near to each other, the collaboration of support staff between the three Hamburg regional compute centers involved (DKRZ, RRZ, and TUHH RZ) has been limited in the past. As part of the project, this collaboration will be strengthened in particular to coordinate performance engineering within Hamburg's institutions. Thus, not only a user can reduce the runtimes of his jobs but also the three compute centers could be operated more efficiently. This leads to a classical win-win situation.
We will develop a model to approximate costs and embed it into the workload manager SLURM (used at DKRZ, RRZ, and TUHH RZ). In this way feedback can be given to users indicating the impacts of running unoptimized workloads.
To reach these goals, we have established the HHCC (Hamburg HPC Competence Center). In the further development of HHCC we have planned to provide an analysis tool to identify two classes of performance issues: 1) well-known configuration mistakes that lead to suboptimal performance and 2) likely performance issues. In case of well-known configuration mistakes the tool will automatically send emails to inform the user how to resolve the issue. For likely performance issues it is not so easy to distinguish between applications for which the optimization potential can be easily exploited and applications that are hard to parallelize, i.e. that have a so called sequential nature, as some efficient tree search algorithms, for example.