Abstract

Distributed data-parallel optimization accelerates the training of neural networks, but requires constant synchronization of gradients between the workers, which can become a bottleneck. One way to reduce communication overhead is to use Local SGD, where each worker asynchronously takes multiple local gradient steps, after which the model weights are averaged. In this work, we discuss the choice of learning rate for Local SGD, showing that it faces an intricate trade-off. Unlike in the synchronous case, its gradient estimate is biased, with the bias dependent on the learning rate itself. Thus using learning rate scaling techniques designed for faster convergence in the synchronous case with Local SGD results in a performance degradation as previously observed. To analyze the manifestation of this bias, we study convergence behaviour of Local SGD and synchronous data-parallel SGD when using their optimal learning rates. Our experiments show that the optimal learning rate for Local SGD differs substantially from that of SGD, and when using it the performance of Local SGD matches that of SGD. However, this performance comes at the cost of added training iterations, rendering Local SGD faster than SGD only when communication is much more time-consuming than computation. This suggests that Local SGD may be of limited practical utility.

On the Choice of Learning Rate for Local SGD

Lukas Balles · Prabhu Teja S · Cedric Archambeau

Video

Paper PDF

Abstract