Optimizer

Neural network in essence is a Optimization problem . With forward computing and back propagation , Optimizer use back-propagation gradients to optimize parameters in a neural network.

1. Adam

Optimizer of Adam is a method to adaptively adjust learning rate, fit for most non-convex optimization , big data set and high-dimensional scenarios. Adam is the most common optimization algorithm.

API Reference: Adam

2. SGD

SGD is an offspring class of Optimizer implementing Stochastic Gradient Descent which is a method of Gradient Descent . When it needs to train a large number of samples, we usually choose SGD to make loss function converge more quickly.

API Reference: SGD

3. Momentum

Momentum optimizer adds momentum on the basis of SGD , reducing noise problem in the process of random gradient descent. You can set use_nesterov as False or True, respectively corresponding to traditional Momentum(Section 4.1 in thesis) algorithm and Nesterov accelerated gradient(Section 4.2 in thesis) algorithm.

API Reference: Momentum

4. AdamW

AdamW optimizer is an improved version of Adam, which decouples weight decay (regularization) from gradient updates, addressing the issue of L2 regularization failure in Adam.

API Reference: AdamW

5. Adagrad

Adagrad optimizer can adaptively allocate different learning rates for parameters to solve the problem of different sample sizes for different parameters.

API Reference: Adagrad

6. RMSProp

RMSProp optimizer is a method to adaptively adjust learning rate. It mainly solves the problem of dramatic decrease of learning rate in the mid-term and end term of model training after Adagrad is used.

API Reference: RMSProp

7. Adamax

Adamax is a variant of Adam algorithm, simplifying limits of learning rate, especially upper limit.

API Reference: Adamax

8. Lamb

Lamb aims to increase the batch size during training without compromising accuracy, supporting adaptive element-wise updates and precise layer-wise correction.

API Reference: Lamb

9. NAdam

NAdam optimizer is based on Adam, combining the advantages of Nesterov momentum and Adam’s adaptive learning rate.

API Reference: NAdam

10. RAdam

RAdam optimizer improves upon Adam by introducing an adaptive learning rate warmup strategy, enhancing the initial stability of training.

API Reference: RAdam

11. ASGD

ASGD optimizer, it is a strategy version of SGD that trades space for time, and is a stochastic optimization method with trajectory averaging. On the basis of SGD, ASGD adds a measure of the average value of historical parameters, making the variance of noise in the descending direction decrease in a decreasing trend, so that the algorithm will eventually converge to the optimal value at a linear speed.

API Reference: ASGD

12. Rprop

Rprop optimizer, this method considers that the magnitude of gradients for different weight parameters may vary greatly, making it difficult to find a global learning step size. Therefore, an innovative method is proposed to accelerate the optimization process by dynamically adjusting the learning step size through the use of parameter gradient symbols.

API Reference: Rprop

13. LBFGS

LBFGS employs the limited-memory BFGS method, approximating the inverse of the Hessian matrix to update parameters.

API Reference: LBFGS

14. Adadelta

Adadelta optimizer is an improved version of Adagrad, using exponential moving averages to adjust the learning rate, mitigating the rapid decline in learning rate during later stages of training and enhancing stability.

API Reference: Adadelta