Sparsified SGD With Memory

Authors:
Sebastian Stich EPFL
Jean-Baptiste Cordonnier EPFL
Martin Jaggi EPFL

Introduction:

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e.Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification).

Abstract:

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far.

You may want to know: