Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures

2019 
Sorting kernels are a fundamental part of numerous applications. The performance of sorting implementations is usually limited by a variety of factors such as computing power, memory bandwidth, and branch mispredictions. In this paper we propose an efficient hybrid sorting method which takes advantage of wide vector registers and the high bandwidth memory of modern AVX-512-based multi-core and many-core processors. Our approach employs a combination of vectorized bitonic sorting and load-balanced multi-threaded merging. Thread-level and data-level parallelism are used to exploit both compute power and memory bandwidth. Our single-threaded implementation is ~30x faster than qsort in the C standard library and ~10x faster than C++'s std::sort. Compared with the Intel Performance Primitives (IPP) library which is one of the most efficient CPU-based radix sort implementation, we obtain a speedup of 1.3 to 2.6. Furthermore, we achieve a peak performance of sorting 1.14 billion floats per second on a Xeon Phi 7210 processor. Moreover, we show the extensibility of our vectorized kernels to processing units with a varying of vector lanes.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    3
    Citations
    NaN
    KQI
    []