Formal modeling and performance evaluation of a run-time rank remapping technique in Broadcast, Allgather and Allreduce MPI collective operations

2017 
MPI collective operations are implemented using a variety of algorithms which define different communication patterns between the ranks involved in the operation. The performance of these algorithms in multi-core clusters highly depends on the mapping of the ranks to the system processors due to the uneven capabilities of shared memory and network channels. The hierarchical design of these algorithms contributes to use optimally the communication channels. Nevertheless, common hierarchical algorithms have shown themselves, for some collectives as allgather, inefficient and even impracticable. This paper analyzed the reasons for that and works out an alternate approach through performance modeling. Such approach, departing from the a priori knowledge of a regular mapping as round-robin or sequential, and keeping the original algorithm unmodified, switches at run time the rank to process mapping into another regular mapping that reduces network traffic. The methodology is evaluated with three collectives and their underlying algorithms, showing speedups of up to 5x in the Binomial Tree or 3x in Ring algorithms compared to unfavorable mappings.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    2
    Citations
    NaN
    KQI
    []