Finding, analysing and solving MPI communication bottlenecks in Earth System models

2018 
Abstract It is a matter of consensus that the ability to efficiently use current and future high performance computing systems is crucial for science, however, the reality is that the performance currently achieved by most of the parallel scientific applications is far from desired. Despite inter-process communication has already been a matter of study in many different works, it is a fact that their recommendations are not taken into account in most of computational model development processes, at least in the case of Earth Science. This work presents a methodology that aims to help scientists working with computational models using inter-process communication, to deal with the difficulties they face when trying to understand their applications behaviour. Following a series of steps that are presented here, both users and developers will learn how to identify performance issues by characterizing applications scalability, identifying which parts present a bad performance and understand the role that inter-process communication plays. In this work, the Nucleus for European Modelling of the Ocean (NEMO), the state-of-the-art European global ocean circulation model, will be used as an example of success. It is a community code widely used in Europe, to the extent that more than a hundred million core hours are used every year in experiments involving NEMO. In the analysis exercise, it is shown how to answer the questions of where, why and what is degrading model's scalability, and how this information can help developers in finding solutions that will mitigate their eventual issues. This document also demonstrates how performance analysis carried out with small size experiments, using limited resources, can lead to optimizations that will impact bigger experiments running on thousands of cores, making it easier to deal with the exascale challenge.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    10
    Citations
    NaN
    KQI
    []