Suffix Array Construction on Multi-GPU Systems

2019 
Suffix arrays are prevalent data structures being fundamental to a wide range of applications including bioinformatics, data compression, and information retrieval. Therefore, various algorithms for (parallel) suffix array construction both on CPUs and GPUs have been proposed over the years. Although providing significant speedup over their CPU-based counterparts, existing GPU implementations share a common disadvantage: input text sizes are limited by the scarce memory of a single GPU. In this paper, we overcome aforementioned memory limitations by exploiting multi-GPU nodes featuring fast NVLink interconnects. In order to achieve high performance for this communication-intensive task, we design a parallel inter-GPU (re-)merging scheme. To handle segments spanning multiple GPUs, we propose an efficient strategy for the merging phase facilitated by a fast partitioning search. On 8 GPUs our implementation achieves speedups between 133 and 354 over sequential CPU-based libdivsufsort, between 30 and 68 over its multi-threaded shared memory version using 80 threads on 40 CPU cores for large datasets ranging from 697M to 3159M characters in size. For medium-sized datasets ranging between 104M and 236M, our approach yields maximum (minimum) speedups of 11.7 (4.5) and 6.45 (4.5) over existing single-GPU implementations (CUDPP, NVBIO). We are able to construct the suffix array of a full human genome on a single DGX-1 server within a runtime of 3.44~seconds which is faster than the 4.8 seconds that were previously reported employing 1600 cores on 100 nodes on a CPU-based HPC cluster. Our implementation is publicly available at https://gitlab.rlp.net/pararch/multi-gpu-suffix-array/.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    0
    Citations
    NaN
    KQI
    []