Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs

2018 
Tensor contractions are higher dimensional analogs of matrix multiplications, used in many computational contexts such as high order models in quantum chemistry, deep learning, finite element methods etc. In contrast to the wide availability of high-performance libraries for matrix multiplication on GPUs, the same is not true for tensor contractions. In this paper, we address the optimization of a set of symmetrized tensor contractions that form the computational bottleneck in the CCSD(T) coupled-cluster method in computational chemistry suites like NWChem. Some of the challenges in optimizing tensor contractions that arise in practice from the variety of dimensionalities and shapes for tensors include effective mapping of the high-dimensional iteration space to threads, choice of data buffering in shared-memory and registers, and tile sizes for multi-level tiling. Furthermore, in the case of symmetrized tensor contractions in CCSD(T), it is also a challenge to fuse contractions to reduce data movement cost by exploiting reuse of intermediate tensors. In this paper, we develop an efficient GPU implementation of the tensor contractions in CCSD(T) using shared-memory buffering, register tiling, loop fusion and register transpose. Experimental results demonstrate significant improvement over the current state-of-the-art.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    9
    Citations
    NaN
    KQI
    []