Automatic Placement of Tasks to NUMA Nodes in Iterative Applications

2020 
Manycore architectures with non-uniform memory access (NUMA) are commonly used for high-performance computing. On these systems, the placement of data and computation to the NUMA nodes has very significant impact on performance, especially with memory-bound applications. This placement can usually be defined by the programmer, but it is generally desirable to automate the placement to simplify the programmer’s job, improve portability, and make the code more future-proof. Task-based runtime systems already assume a fair degree of responsibility for task placement, so it is only natural to involve them in mapping work and data to NUMA nodes. In this work, we propose a solution where the runtime system first performs a profiling run of the application and measures various performance characteristics. Then, the data collected by the profiling run is used by a stand-alone analyzer to create a plan for placing tasks to NUMA nodes so that the tasks are close to the data that they most rely on. This plan is then used by the runtime system to execute the application more efficiently. We focus on iterative applications, where the same patterns of tasks are being repeated. We identify these patterns and use them to create a plan that works for any number of iterations, not just the one that used when the application was observed. In our experiments, which were performed on modern manycore systems (Intel Skylake and AMD Zen) with 4 and 8 NUMA nodes, the proposed automated placement can either match or come close $(\lt 10$%) to a hand-tuned placement.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    0
    Citations
    NaN
    KQI
    []