Performance considerations for scalable parallel tensor decomposition

2017 
Abstract Tensor decomposition, the higher-order analogue to singular value decomposition, has emerged as a useful tool for finding relationships in large, sparse, multidimensional data. As this technique matures and is applied to increasingly larger data sets, the need for high performance implementations becomes critical. A better understanding of the performance characteristics of tensor decomposition on large and sparse tensors can help drive the development of such implementations. In this work, we perform an objective empirical evaluation of three state of the art parallel tools that implement the Canonical Decomposition/Parallel Factorization tensor decomposition algorithm using alternating least squares fitting (CP-ALS): SPLATT, DFacTo, and ENSIGN. We conduct performance studies across a variety of data sets and evaluate the tools with respect to total memory required, processor stall cycles, execution time, data distribution, and communication patterns. Furthermore, we investigate the performance of the implementations on tensors with up to 6 dimensions and when executing high rank decompositions. We find that tensor data structure layout and distribution choices can result in differences as large as 14.6x with respect to memory usage and 39.17x with respect to execution time. We provide an outline of a distributed heterogeneous CP-ALS implementation that addresses the performance issues we observe.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    8
    Citations
    NaN
    KQI
    []