Enabling One-size-fits-all Compilation Optimization across Machine Learning Computers for Inference

2021 
Machine Learning Computers (MLCs) with tensor functional units (e.g., NVIDIA's Tensor Core, Google's TPU and Habana's Tensor Processor Core) have emerged significantly over recent years. The broad diversity of MLCs makes it hard to deploy machine learning workloads with optimized performance. Though deep learning compilers (e.g., TVM) are effective to produce optimized code for different hardware back-ends, when deploying to a new MLC, it is tedious to implement platform-specific compilation optimizations by thoroughly understanding system/architectural details. To address this problem, we propose a holistic approach to achieve one-size-fits-all compilation optimization across different MLCs or inference. The key observation is that diverse MLCs share multiple key architectural characteristics for tensor processing, which can be generalized for conducting cross-platform compilation optimizations. Concretely, we propose the Tensor Abstract Machine (TAM), which features such common architectural characteristics, as the abstraction of a broad range of MLCs. To leverage architectural characteristics of the TAM, we propose the Tensor Scheduling Language (TSL) consisting of tensor computation description and tensor scheduling primitives for implementing operations with portable optimization. Experimental results demonstrate that the code generated from the same optimization schedule achieves 1.05x to 2.05x better performance than hand-tuned libraries and deep learning compilers across different platforms.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []