GPQ: Greedy Partial Quantization of Convolutional Neural Networks Inspired by Submodular Optimization

2020 
Recent work has revealed that the effects of neural network quantization on inference accuracy are different for each layer. Therefore, partial quantization and mixed precision quantization have been studied for neural network accelerators with multi-precision designs. However, these quantization methods generally require network training that entails a high computational cost or exhibit a significant loss of inference accuracy. In this paper, we propose a greedy search algorithm for partial quantization that can derive optimal combinations of quantization layers; notably, the proposed method exhibits a low computational complexity, O(N2) (N denotes the number of layers). The proposed Greedy Partial Quantization (GPQ) achieved 4.2 × model size compression with only -0.03% accuracy loss in ResNet50 and 2.5× compression with +0.015% accuracy gain in Xception. The computational cost of GPQ is only 2.5 GPU-hours in the case of EfficientNet-B0 8-bit quantization for ImageNet classification.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []