Machine Learning: Gradient Descent I

Welcome back! In the previous post, I had talked about univariate linear regression, and defined the cost function as:

\begin{aligned} J( \theta_0 , \theta_1) &= \frac 1 {2m} \sum_{i=1}^m (h_\theta(x^{(i)})~-~y^{(i)})^2 \\ where~h_\theta(x) &= \theta_0~+~\theta_1x \end{aligned}

Here m denotes the number of training examples, (x(i), y(i)) denotes the i’th training example, and θ0 and θ1 are the parameters of the model. As we discussed earlier, we need to find the values of θ0 and θ1 so that J is minimized. But how do we actually do this? One algorithm that serves the purpose is gradient descent. The idea behind gradient descent is:

  • Start with some initial θ0 and θ1 (say, θ0 = θ1 = 0)
  • Repeatedly change θ0 and θ1 so as to decrease J(θ0, θ1) gradually, until we reach a minimum

Mathematical form

The mathematical form of the gradient descent algorithm is:

\begin{aligned} re&peat~until~convergence~\{ \\ &\theta_j := \theta_j~-~\alpha \frac \partial {\partial\theta_j} J(\theta_0,\theta_1) \\ &(j = 0,1) \\ \}~~&~ \end{aligned}

Let’s walk through this slowly. The := symbol denotes an assignment (a := b means ‘variable a is assigned the value b’) rather than an equality test (a = b means ‘are the values of a and b equal?’). The term:

\begin{aligned} \frac \partial {\partial\theta_j} J(\theta_0,\theta_1) \end{aligned}

is called a partial derivative. These terms (for j = 0 and j = 1) indicate the ‘slope’ of the graph in a 3D plot of the function J versus θ0 and θ1 at a particular point (such a plot might look like the one in the figure below). (More on this later.) Also, the parameter α (alpha) is called the learning rate. It controls the extent to which θ0 and θ1 change at every iteration, and is always positive.

Cost function plot
(Source: Andrew Ng’s “Machine Learning” course on Coursera)

Simultaneous Updates

Another important thing is that in the above algorithm, θ0 and θ1 must be simultaneously updated. To understand this, consider the two pseudocode fragments:

\begin{aligned} Fragm&ent~1: \\ t0 :&= \theta_0~-~\alpha \frac \partial {\partial\theta_0} J(\theta_0,\theta_1)\\ t1 :&= \theta_1~-~\alpha \frac \partial {\partial\theta_1} J(\theta_0,\theta_1)\\ \theta_0 :&= t0 \\ \theta_1 :&= t1 \\ Fragm&ent~2: \\ t0 :&= \theta_0~-~\alpha \frac \partial {\partial\theta_0} J(\theta_0,\theta_1)\\ \theta_0 :&= t0 \\ t1 :&= \theta_1~-~\alpha \frac \partial {\partial\theta_1} J(\theta_0,\theta_1)\\ \theta_1 :&= t1 \\ \end{aligned}

Fragment 1 simultaneously updates θ0 and θ1. Note that in this fragment, the original values of θ0 and θ1 are used to calculate the partial derivative terms. But fragment 2 is not equivalent to fragment 1. Why? In fragment 2, first θ0 is updated, and this updated value is used to evaluate the partial derivative in the third line. This is not what we want – the original values of θ0 and θ1 must be used to compute the partial derivatives – a simultaneous update.

Intuition

After reading the previous sections, one may ask, “But what is the algorithm actually doing? How can we understand that it is actually minimizing the cost function?” Let us try to understand this. For simplicity, we can reduce the problem to one dimension, so the cost function is a function of only one parameter, θ. The algorithm now reads:

\begin{aligned} re&peat~until~convergence~\{ \\ &\theta := \theta~-~\alpha \frac d {d\theta} J(\theta) \\ \}~~&~ \end{aligned}

(In one variable, the partial derivative is replaced by an ordinary derivative, d/dθ.) The derivative of the cost function at a point is equal to the slope of the tangent to the curve at that point. Now assume that the plot of J(θ) vs θ looks like the one in the figure below:

Cost function (one variable)

First assume that the initial value of θ is at point A in the figure. At this point, the slope of the graph (and hence the derivative) is positive. So the update on θ would look like:

\begin{aligned} \theta := \theta~-~\alpha(positive~number) \end{aligned}

which decreases θ (as α is always positive). If we look at the figure, this means we move closer to the minimum, which is what we want. Similarly, if we were initially at point B in the figure (negative slope), the update would then look like:

\begin{aligned} \theta :&= \theta~-~\alpha(negative~number)\\ &= \theta~+~\alpha(positive~number) \end{aligned}

which increases θ, again moving it closer to the minimum.

Some additional points

Two additional points worth noting are:

  • In the two variable case, the partial derivatives also measure the ‘slopes’ of the 3D plot of the cost function in two perpendicular directions, and hence the algorithm performs the same task.
  • If the cost function plot has multiple local minima, it is not possible to predict in advance which one the gradient descent algorithm will converge to.

I have introduced the idea of gradient descent in this post. In the next post, I will continue talking about gradient descent and show how it can be applied to linear regression – please do stay tuned!

(Visited 295 times, 1 visits today)

Related Posts

Leave a Reply