Week 2 - Machine Learning
Multivariate Linear Regression, Normal Equation, Octave Tutorial

Mutiple Features
Linear regression with multiple variables is also known as multivariate linear regression.
The notation for equations:
$$ x_j^{(i)} = \text{value of feature } j \text{ in the }i^{th}\text{ training example} $$
$$ x^{(i)} = \text{the input (features) of the }i^{th}\text{ training example} $$
$$ m = \text{the number of training examples} $$
$$ n = \text{the number of features} $$
The multivariable form of the hypothesis function:
$$ h_\theta (x) = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n $$
Assume $$ x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } )$$
Then, multivariable hypothesis function can be concisely represented as:
$$ h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} … \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x $$
Gradient Descent for Multiple Variables
Repeat until convergence: { $$\theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \qquad \text{for j := 0…n}$$ }
Feature Scaling and Mean Normalization
We can speed up gradient descent by having each of our input values in roughly the same range. This is because $\theta$ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value or the standard deviation) of the input variable, resulting in a new range of just 1.
Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
So:
$$x_{i} := \frac{x_{i} - \mu_{i}}{s_{i}}$$
$\mu_{i}$ is the average of all values for feature $(i)$ $s_{i}$ is the range of values (max - min), is the standard deviation.
Debugging Gradient Descent by Learning Rate
Make a plot with number of iterations on the x-axis. Now plot the cost function, $J(\theta)$ over the number of iterations of gradient descent. If $J(\theta)$ ever increases, then you probably need to decrease $\alpha$.
If $\alpha$ is too small: slow convergence.
If $\alpha$ is too large: may not decrease on every iteration and thus may not converge.
Polynominal Regression
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Normal Equation
$$\theta = (X^TX)^{-1}X^Ty$$
A method of finding the optimum theta without iteration.
No need to do feature scaling with the normal equation.
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
$O(kn^2)$ | $O(n^3)$ need to calculate $(X^TX)^{-1}$ |
Works well when n is large | Slow if n is very large |
Noninvertability
If $X^TX$ is noninvertible, the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).
When implementing the normal equation in octave we want to use the ‘pinv’ function rather than ‘inv.’