Authors: Zeyuan Allen-Zhu Microsoft Research David Simchi-Levi MIT Xinshang Wang MIT

Introduction:

Classically, the time complexity of a first-order method is estimated by its number of gradient computations.In this paper, the authors study a more refined complexity by taking into account the ``lingering' of gradients: once a gradient is computed at \$x_k\$, the additional time to compute gradients at \$x_{k+1},x_{k+2},dots\$ may be reduced.The authors show how this improves the running time of gradient descent and SVRG.

Abstract:

Classically, the time complexity of a first-order method is estimated by its number of gradient computations. In this paper, we study a more refined complexity by taking into account the ``lingering'' of gradients: once a gradient is computed at \$x_k\$, the additional time to compute gradients at \$x_{k+1},x_{k+2},\dots\$ may be reduced.We show how this improves the running time of gradient descent and SVRG. For instance, if the "additional time'' scales linearly with respect to the traveled distance, then the "convergence rate'' of gradient descent can be improved from \$1/T\$ to \$\exp(-T^{1/3})\$. On the empirical side, we solve a hypothetical revenue management problem on the Yahoo! Front Page Today Module application with 4.6m users to \$10^{-6}\$ error (or \$10^{-12}\$ dual error) using 6 passes of the dataset.