Parallelizing Adam Optimizer with Blockwise Model-Update Filtering

2020 
Recently Adam has become a popular stochastic optimization method in deep learning area. To parallelize Adam in a distributed system, synchronous stochastic gradient (SSG) technique is widely used, which is inefficient due to heavy communication cost. In this paper, we attempt to parallelize Adam with blockwise model-update filtering (BMUF) instead. BMUF synchronizes model-update periodically and introduces a block momentum to improve performance. We propose a novel way to modify the estimated moment buffers of Adam and figure out a simple yet effective trick for hyper-parameter setting under BMUF framework. Experimental results on large scale English optical character recognition (OCR) task and large vocabulary continuous speech recognition (LVCSR) task show that BMUF-Adam achieves almost a linear speedup without recognition accuracy degradation and outperforms SSG-based method in terms of speedup, scalability and recognition accuracy.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    7
    Citations
    NaN
    KQI
    []