Machine Learning: Gradient Descent II

Listen to this article

Welcome again! This is a continuation of my previous post about gradient descent. In this post, I will talk about the learning rate, and show how the algorithm can be applied to univariate linear regression. So let’s get started!

The learning rate

As I had mentioned in the previous post, the learning rate α controls the extent to which θ0 and θ1 change at every iteration. How to choose this parameter? The following two factors govern its selection:

  • If α is too small, the changes in θ0 and θ1 at every iteration will also be very small. As a result, gradient descent will be slow to converge.
  • If α is too large, the changes in θ0 and θ1 at every iteration will be so large that you could overshoot the minimum. Because of this, gradient descent may fail to converge, or even diverge.

With these two points in mind, one way to choose α is to choose some initial value of α (say, 0.1), and run gradient descent. As gradient descent runs, plot the cost function J versus the number of iterations. (You must calculate J at every iteration to do this.) If you think the algorithm converges too slowly, increase α; if it is not converging at all, decrease α. Then repeat with the new value of α.

Another point to be noted is that gradient descent can converge to a local minimum, even with the learning rate fixed. As we approach a minimum, the slope of the graph becomes closer and closer to zero, which automatically reduces the magnitude of the ‘step size’. So there is no need to decrease α over time.

Gradient descent for linear regression

Having learnt about gradient descent, let’s see how it is implemented in univariate linear regression. The cost function for this problem is:

\begin{aligned} J( \theta_0 , \theta_1) &= \frac 1 {2m} \sum_{i=1}^m (h_\theta(x^{(i)})~-~y^{(i)})^2 \\ where~h_\theta(x) &= \theta_0~+~\theta_1x \end{aligned}

and the gradient descent algorithm is:

\begin{aligned} re&peat~until~convergence~\{ \\ &\theta_j := \theta_j~-~\alpha \frac \partial {\partial\theta_j} J(\theta_0,\theta_1) \\ &(j = 0,1) \\ \}~~&~ \end{aligned}

The computation of the partial derivatives requires knowledge of calculus, and I will skip the derivation here (if you know calculus, you can verify the results yourself). It can be shown that:

\begin{aligned} 1.~~\frac \partial {\partial\theta_0}J(\theta_0,\theta_1) &= \frac 1 m \sum_{i=1}^m(h_\theta(x^{(i)})~-~y^{(i)})\\ 2.~~\frac \partial {\partial\theta_1}J(\theta_0,\theta_1) &= \frac 1 m \sum_{i=1}^m(h_\theta(x^{(i)})~-~y^{(i)}).x^{(i)} \end{aligned}

Thus, the algorithm can be now written as:

\begin{aligned} re&peat~until~convergence~\{ \\ &\theta_0 := \theta_0~-~\frac \alpha m \sum_{i=1}^m(h_\theta(x^{(i)})~-~y^{(i)})\\ &\theta_1 := \theta_1~-~\frac \alpha m \sum_{i=1}^m(h_\theta(x^{(i)})~-~y^{(i)}).x^{(i)}\\ &(simultaneously~update~\theta_0,~\theta_1)\\ \}~~&~ \end{aligned}

(we have just substituted the values of the partial derivatives into the algorithm). This is the gradient descent algorithm for univariate linear regression, which minimizes the cost function, and hence the squared error of prediction (as discussed earlier).

Points to remember

Here are some additional points to note:

  1. The algorithm above says to ‘repeat until convergence’. How do you check whether the algorithm is converging? Just calculate the value of the cost function J at every iteration, and declare convergence if J decreases by less than, say, 10-5 after each iteration.
  2. As I mentioned in the previous post, gradient descent can converge to any one of the local minima, if there are many of them. However, in the case of univariate linear regression, this is not a problem, since the cost function has only one local minimum, which is also the global minimum.

With this we come to the end of the topic. In later posts, I will explore some more aspects of machine learning – please do stay tuned!

(Visited 179 times, 1 visits today)

Related Posts

Leave a Reply