Load balancing is a fundamental task in most web services. For example, when a user clicks on a website, this requires locating a particular server, or cache, and then fetching the content. If the distribution of content is poor, e.g. in a video streaming service many popular videos are stored on the same server, this can lead to increased latency as users all try to fetch the content from the same server and, in the worst case, the server can crash. In another example, in order to retrieve an item from a database, it needs to be efficiently and conveniently stored to minimize both space and time requirements. In both cases, the challenge lies in efficiently storing and retrieving items in a distributed setting.
One long-standing issue preventing the complete adoption of adaptive optimization methods for training neural networks is the generalization gap between adaptive optimization methods and stochastic gradient descent with momentum (SGDM). Using adaptive methods, such as Adaptive Moment Estimation (Adam) which I will introduce later, with default values or with slight tuning usually yields good performance. This may sometimes be good enough, and at the very least serves as a useful baseline. However, the performance of adaptive methods on validation or test sets especially with regards to image data is often somewhat worse than can be achieved with SGDM. This is clearly reflected in the top papers pushing the state-of-the-art image classification error for imagenet, where SGDM is almost universally used. An important objective is to close the generalization gap between adaptive methods and SGDM, thus reducing the amount of effort spent tuning, capturing the typical quick initial learning of adaptive methods and at the same time squeezing the most performance out of the architecture in question.