One long-standing issue preventing the complete adoption of adaptive optimization methods for training neural networks is the generalization gap between adaptive optimization methods and stochastic gradient descent with momentum (SGDM). Using adaptive methods, such as Adaptive Moment Estimation (Adam) which I will introduce later, with default values or with slight tuning usually yields good performance. This may sometimes be good enough, and at the very least serves as a useful baseline. However, the performance of adaptive methods on validation or test sets especially with regards to image data is often somewhat worse than can be achieved with SGDM. This is clearly reflected in the top papers pushing the state-of-the-art image classification error for imagenet, where SGDM is almost universally used. An important objective is to close the generalization gap between adaptive methods and SGDM, thus reducing the amount of effort spent tuning, capturing the typical quick initial learning of adaptive methods and at the same time squeezing the most performance out of the architecture in question.