SYMBIOSYS: A Methodology for Performance Analysis of Composable HPC Data Services

2021 
Microservices are a powerful new way of building, customizing, and deploying distributed services owing to their flexibility and maintainability. Several large-scale distributed platforms have emerged to serve the growing needs of data-centric workloads and services in commercial computing. Concurrently, high-performance computing (HPC) systems and software are rapidly evolving to meet the demands of diversified applications and heterogeneity. The interplay of hardware factors, software configuration parameters, and the flexibility offered with a microservice architecture makes it nontrivial to estimate the optimal service instantiation for a given application workload. Further, this problem is exacerbated when considering that these services operate in a dynamic and heterogeneous HPC environment. An optimally integrated service can be vastly more performant than a haphazardly integrated one. Existing performance tools for HPC either fail to understand the request-response model of communication inherent to microservices or they operate within a narrow scope, limiting the insight that can be gleaned from employing them in isolation.We propose a methodology for integrated performance analysis of HPC microservices frameworks and applications called SYMBIOSYS. We describe its design and implementation within the context of the Mochi framework. This integration is achieved by combining distributed callpath profiling and tracing with a performance data exchange strategy that collects fine-grained, low-level metrics from the RPC communication library and network layers. The result is a portable, low-overhead performance analysis setup that provides a holistic profile of the dependencies among microservices and how they interact with the Mochi RPC software stack. Using HEPnOS, a production-quality Mochi data service, we demonstrate the low-overhead operation of SYMBIOSYS at scale and use it to identify the root causes of poorly performing service configurations.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    0
    Citations
    NaN
    KQI
    []