A Diffusion Theory For Minima Selection: Stochastic Gradient Descent Escapes Sharp Minima Exponentially Fast

2020 
Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find flat minima with a large neighboring region in parameter space from which each weight vector has similar small error. However, the quantitative theory behind stochastic gradients still remains to be further investigated. In this paper, we focus on a fundamental problem in deep learning, "How can deep learning select flat minima among so many minima?" To answer the question quantitatively, we develop a density diffusion theory (DDT) to reveal how minima selection quantitatively depends on minima sharpness, gradient noise and hyperparameters. One of the most interesting findings is that stochastic gradient noise from SGD can accelerate escaping sharp minima exponentially in terms of top eigenvalues of minima hessians, while white noise can only accelerate escaping sharp minima polynomially in terms of the determinants of minima hessians. We also find large-batch training requires exponentially many iterations to escape sharp minima in terms of batch size. We present direct empirical evidence supporting the proposed theoretical results.
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    32
    References
    6
    Citations
    NaN
    KQI
    []