Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures

Zekun Yin,Tianyu Zhang,André Müller,Hui Liu,Yanjie Wei,Bertil Schmidt,Weiguo Liu

Efficient Parallel Sort on AVX-512-Based Multi-Core and Many-Core Architectures

2019

Sorting kernels are a fundamental part of numerous applications. The performance of sorting implementations is usually limited by a variety of factors such as computing power, memory bandwidth, and branch mispredictions. In this paper we propose an efficient hybrid sorting method which takes advantage of wide vector registers and the high bandwidth memory of modern AVX-512-based multi-core and many-core processors. Our approach employs a combination of vectorized bitonic sorting and load-balanced multi-threaded merging. Thread-level and data-level parallelism are used to exploit both compute power and memory bandwidth. Our single-threaded implementation is ~30x faster than qsort in the C standard library and ~10x faster than C++'s std::sort. Compared with the Intel Performance Primitives (IPP) library which is one of the most efficient CPU-based radix sort implementation, we obtain a speedup of 1.3 to 2.6. Furthermore, we achieve a peak performance of sorting 1.14 billion floats per second on a Xeon Phi 7210 processor. Moreover, we show the extensibility of our vectorized kernels to processing units with a varying of vector lanes.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations