An Overview of Optimization Algorithms

In Machine Learning, finding the best numerical solution to a given problem is important. This blog will lead you through the machine learning optimization process, including how loss functions fit into the equation and several popular techniques. 

Optimization is concerned with finding the set of inputs that minimizes (or maximizes) the target objective function. Knowing how the value of a function changes as its input is varied is useful because it tells us in which direction we can move to improve on previous points. The change in the value of the function is measured by the derivative in one dimension and the gradient in multiple dimensions.


The derivative f ′ ( x ) of a function f of a single variable x is the rate at which the value of f changes at x. The value of the derivative equals the slope of the tangent line. 

Derivatives are useful in optimization because they provide information about how to change a given point in order to improve the objective function. 


Image Source: Derivative 


A gradient is a derivative of a function that has multiple dimensions. It captures the local slope of the function, allowing us to predict the effect of taking a small step from a point in any direction. The gradient points in the direction of the steepest ascent of the tangent hyperplane. A hyperplane in an n-dimensional space is the set of points that satisfies 

w 1 x 1 + · · · + w n x n = b


Image source: Gradient 

First-order methods 

First-order methods rely on gradient information to help direct the search for an optimum (minimum) value.

Gradient Descent 

Gradient Descent is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. (a differentiable function is a function whose derivative exists at each point in its domain). The idea is to take repeated steps in the opposite direction of the gradient (or approximate gradient) of the function at the current point because this is the direction of steepest descent.

Gradient Descent

Image source: Gradient Descent 

Jagged search paths result if we choose a step size that leads to the maximal decrease in function. In fact, the next direction will always be orthogonal to the current direction.

Conjugate Gradient 

Gradient descent can perform poorly in narrow valleys. The conjugate gradient method overcomes this issue by borrowing inspiration from methods for optimizing quadratic functions. The conjugate gradient method can also be used to solve unconstrained optimization problems such as energy minimization.

Conjugate Gradient

Image source: Conjugate Gradient 


Gradient descent will take a long time to traverse a nearly flat surface allowing momentum to accumulate is one way to speed progress. We can modify gradient descent to incorporate momentum. 

Momentum can be interpreted as a ball rolling down a nearly horizontal incline. The ball naturally gathers momentum as gravity causes it to accelerate, just as the gradient causes momentum to accumulate in this descent method.


Image source: Momentum 

Nesterov Momentum

One issue of momentum is that the steps do not slow down enough at the bottom of a valley and tend to overshoot the valley floor. Nesterov momentum modifies the momentum algorithm to use the gradient at the projected future position.

Nesterov Momentum

Image source: Nesterov Momentum 


The Adaptive Gradient Technique (Adagrad) is a gradient-based optimization algorithm. By incorporating knowledge from previous observations, the learning rate is component-wise adapted to the parameters. The learning rate setting has a much lower impact on Adagrad. The learning rate option is usually set to 0.01 by default. The fact that the components of s are all strictly nondecreasing is Adagrad’s primary problem. The accumulated sum causes the effective learning rate to decrease during training, often becoming infinitesimally small before convergence.


RMSprop and Adadelta have both been developed independently around the same time stemming from the need to resolve Adagrad’s radically diminishing learning rates. RMSprop in fact is identical to the first update vector of Adadelta. RMSProp maintains a decaying average of squared gradients.


Adadelta is another method for overcoming Adagrad’s monotonically decreasing learning rate. After independently deriving the RMSProp update, the authors noticed that the units in the update equations for gradient descent, momentum, and Adagrad do not match. To fix this, they use an exponentially decaying average of the square updates which eliminates the learning rate parameter entirely.


The adaptive moment estimation method, or Adam,  also adapts learning rates to each parameter. It stores both an exponentially decaying squared gradient-like RMSProp and Adadelta, but also an exponentially decaying gradient-like momentum. Initializing the gradient and squared gradient to zero introduces a bias. A bias correction step helps alleviate the issue.

Hypergradient Descent 

The accelerated descent methods are either extremely sensitive to the learning rate or go to great lengths to adapt the learning rate during execution. The learning rate dictates how sensitive the method is to the gradient signal. A rate that is too high or too low often drastically affects performance. 

Hypergradient descent was developed with the understanding that the derivative of the learning rate should be useful for improving optimizer performance. A hyper gradient is a derivative taken with respect to a hyperparameter. Hypergradient algorithms reduce the sensitivity to the hyperparameter, allowing it to adapt more quickly. Hypergradient descent applies gradient descent to the learning rate of an underlying descent method. The method requires the partial derivative of the objective function with respect to the learning rate. 

Second-Order Methods  

Newton’s Method  

Knowing the function value and gradient for a design point can help determine the direction to travel, but this first-order information does not directly help determine how far to step to reach a local minimum. Second-order information, on the other hand, allows us to make a quadratic approximation of the objective function and approximate the right step size to reach a local minimum. 

Newton’s method is a root-finding method that leverages second-order information to quickly descend to a local minimum.

Newton's Method

Image Source:  Algorithms for Optimization 

A comparison of first-order and second-order approximations. Bowl-shaped quadratic approximations have unique locations where the derivative is zero.

Secant Method  

Newton’s method for univariate function minimization requires the first and second derivatives f ′ and f ′′. In many cases, f ′ is known but the second derivative is not. The secant method (algorithm 6.2) applies Newton’s method using estimates of the second derivative and thus only requires f ′. This property makes the secant method more convenient to use in practice. 

The secant method and quasi-Newton methods approximate Newton’s method when the second-order information is not directly available.