Machine Learning: Gradient Descent I

Welcome back! In the previous post, I had talked about univariate linear regression, and defined the cost function as:

\begin{aligned} J( \theta_0 , \theta_1) &= \frac 1 {2m} \sum_{i=1}^m (h_\theta(x^{(i)})~-~y^{(i)})^2 \\ where~h_\theta(x) &= \theta_0~+~\theta_1x \end{aligned}

Here m denotes the number of training examples, (x⁽ⁱ⁾, y⁽ⁱ⁾) denotes the i’th training example, and θ₀ and θ₁ are the parameters of the model. As we discussed earlier, we need to find the values of θ₀ and θ₁ so that J is minimized. But how do we actually do this? One algorithm that serves the purpose is gradient descent. The idea behind gradient descent is:

Start with some initial θ₀ and θ₁ (say, θ₀ = θ₁ = 0)
Repeatedly change θ₀ and θ₁ so as to decrease J(θ₀, θ₁) gradually, until we reach a minimum

Mathematical form

The mathematical form of the gradient descent algorithm is:

\begin{aligned} re&peat~until~convergence~\{ \\ &\theta_j := \theta_j~-~\alpha \frac \partial {\partial\theta_j} J(\theta_0,\theta_1) \\ &(j = 0,1) \\ \}~~&~ \end{aligned}

Let’s walk through this slowly. The := symbol denotes an assignment (a := b means ‘variable a is assigned the value b’) rather than an equality test (a = b means ‘are the values of a and b equal?’). The term:

\begin{aligned} \frac \partial {\partial\theta_j} J(\theta_0,\theta_1) \end{aligned}

is called a partial derivative. These terms (for j = 0 and j = 1) indicate the ‘slope’ of the graph in a 3D plot of the function J versus θ₀ and θ₁ at a particular point (such a plot might look like the one in the figure below). (More on this later.) Also, the parameter α (alpha) is called the learning rate. It controls the extent to which θ₀ and θ₁ change at every iteration, and is always positive.

Cost function plot — (Source: Andrew Ng’s “Machine Learning” course on Coursera)

Simultaneous Updates

Another important thing is that in the above algorithm, θ₀ and θ₁ must be simultaneously updated. To understand this, consider the two pseudocode fragments:

\begin{aligned} Fragm&ent~1: \\ t0 :&= \theta_0~-~\alpha \frac \partial {\partial\theta_0} J(\theta_0,\theta_1)\\ t1 :&= \theta_1~-~\alpha \frac \partial {\partial\theta_1} J(\theta_0,\theta_1)\\ \theta_0 :&= t0 \\ \theta_1 :&= t1 \\ Fragm&ent~2: \\ t0 :&= \theta_0~-~\alpha \frac \partial {\partial\theta_0} J(\theta_0,\theta_1)\\ \theta_0 :&= t0 \\ t1 :&= \theta_1~-~\alpha \frac \partial {\partial\theta_1} J(\theta_0,\theta_1)\\ \theta_1 :&= t1 \\ \end{aligned}

Fragment 1 simultaneously updates θ₀ and θ₁. Note that in this fragment, the original values of θ₀ and θ₁ are used to calculate the partial derivative terms. But fragment 2 is not equivalent to fragment 1. Why? In fragment 2, first θ₀ is updated, and this updated value is used to evaluate the partial derivative in the third line. This is not what we want – the original values of θ₀ and θ₁ must be used to compute the partial derivatives – a simultaneous update.

Intuition

After reading the previous sections, one may ask, “But what is the algorithm actually doing? How can we understand that it is actually minimizing the cost function?” Let us try to understand this. For simplicity, we can reduce the problem to one dimension, so the cost function is a function of only one parameter, θ. The algorithm now reads:

\begin{aligned} re&peat~until~convergence~\{ \\ &\theta := \theta~-~\alpha \frac d {d\theta} J(\theta) \\ \}~~&~ \end{aligned}

(In one variable, the partial derivative is replaced by an ordinary derivative, d/dθ.) The derivative of the cost function at a point is equal to the slope of the tangent to the curve at that point. Now assume that the plot of J(θ) vs θ looks like the one in the figure below:

First assume that the initial value of θ is at point A in the figure. At this point, the slope of the graph (and hence the derivative) is positive. So the update on θ would look like:

\begin{aligned} \theta := \theta~-~\alpha(positive~number) \end{aligned}

which decreases θ (as α is always positive). If we look at the figure, this means we move closer to the minimum, which is what we want. Similarly, if we were initially at point B in the figure (negative slope), the update would then look like:

\begin{aligned} \theta :&= \theta~-~\alpha(negative~number)\\ &= \theta~+~\alpha(positive~number) \end{aligned}

which increases θ, again moving it closer to the minimum.

Some additional points

Two additional points worth noting are:

In the two variable case, the partial derivatives also measure the ‘slopes’ of the 3D plot of the cost function in two perpendicular directions, and hence the algorithm performs the same task.
If the cost function plot has multiple local minima, it is not possible to predict in advance which one the gradient descent algorithm will converge to.

I have introduced the idea of gradient descent in this post. In the next post, I will continue talking about gradient descent and show how it can be applied to linear regression – please do stay tuned!

(Visited 295 times, 1 visits today)

Machine Learning: Gradient Descent I

Mathematical form

Simultaneous Updates

Intuition

Some additional points

Related

Leave a ReplyCancel reply

Machine Learning: Gradient Descent I

Mathematical form

Simultaneous Updates

Intuition

Some additional points

Related

Related Posts

The Deepfakes

AI and CyberSecurity

Leave a ReplyCancel reply