Week 1 - Machine Learning - Guocheng's Space

Week 1 - Machine Learning

Notes of Week 1 Machine Learning Coursera Course by Stanford University

June 27, 2019 Guocheng Wei

3 minute read

The Hypothesis Function

$$\hat{y} = h_\theta(x) = \theta_0 + \theta_1 x$$

Cost Function

To measure the accuracy of the hypothesis function. This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x’s compared to the actual output y’s.

$$J(\theta_0, \theta_1) = \dfrac{1}{2m} \displaystyle \sum _{i=1}^m \left( \hat{y}_i- y_i \right)^2 = \dfrac{1}{2m} \displaystyle \sum _{i=1}^m \left (h _\theta(x_i) - y_i \right)^2$$

Break it apart:

$h_\theta (x_i) - y_i$ is the difference between the predicted value and the actual value.

$\bar{x}$ is the mean of all $\left (h_\theta (x_i) - y_i \right)^2$

$J(\theta_0, \theta_1) = \frac{1}{2}\bar{x} $

The function is also called Square Error Function, or Mean squared error

The mean is halfed as a convenience for the computation of the gradient descent, as the derivative of the square function will cancel out the $\frac{1}{2}$ term

Intuition Ⅰ

To simplify the visualization of the cost function, assume $\theta_0$ is 0, which means the cost function is $J(\theta_1)$

Intuition of theta to find the minimal cost function

Intuition Ⅱ

If the $\theta_0$ is not 0, then the contour plot will look like this:

3D Contour Plot

If projected onto a 2d plot

2D Contour Plot

Key features:

Every point on the same ‘circle’ has the same $J(\theta_0, \theta_1)$
The center of the inner most ‘circle’ is the minimal cost function value.

Gradient Descent

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point.

Example of Gradient Descent

The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

As shown above, two points are next to each other, but their local minimal cost function value are different.

The gradient descent algorithm is:

repear until convergence: $$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$ where $j = 0,1$ represents the feature index number

At each iteration j, one should simultaneously update the parameters $\theta_0, \theta_1, \theta_2, …$. (Simultaneous Update)

Intuition Ⅰ

Simplied gradient descent with $\theta_0$ is 0:

repear until convergence: $$\theta_1 := \theta_1 - \alpha \frac{\partial}{\partial \theta_1} J(\theta_1)$$

When the slope is negative, the value of $\theta_1$ increases.

When it is positive, the value of $\theta_1$ decreases.

How slope affects the movement of theta

Intuition Ⅱ

We should adjust our parameter $\alpha$ to ensure that the gradient descent algorithm converges in a reasonable time.

Failure to converge or too much time to obtain the minimum value imply that our step size is wrong.

Step size affects the convergence

Intuition Ⅲ

Gradient descent can converge to a local minimum, even with the learning rate is fixed.

When approaching the local minimum, slope value is smaller, and so the step size is smaller.

Gradient Descent For Linear Regression

\begin{aligned} repeat until convergence: { \\ θ_{0} := & θ_{0} - α \frac{1}{m} \sum_{i = 1}^{m} (h_{θ} (x_{i}) - y_{i}) \\ θ_{1} := & θ_{1} - α \frac{1}{m} \sum_{i = 1}^{m} ((h_{θ} (x_{i}) - y_{i}) x_{i}) \\ } \end{aligned}

Batch gradient descent: looks at every example in the entire training set on every step

Guocheng Wei