Week 2 - Machine Learning
Multivariate Linear Regression, Normal Equation, Octave Tutorial

Mutiple Features
Linear regression with multiple variables is also known as multivariate linear regression.
The notation for equations:
x(i)j=value of feature j in the ith training example
x(i)=the input (features) of the ith training example
m=the number of training examples
n=the number of features
The multivariable form of the hypothesis function:
hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3+⋯+θnxn
Assume x(i)0=1 for (i∈1,…,m)
Then, multivariable hypothesis function can be concisely represented as:
hθ(x)=[θ0θ1…θn][x0x1⋮xn]=θTx
Gradient Descent for Multiple Variables
Repeat until convergence: { θj:=θj−α1mm∑i=1(hθ(x(i))−y(i))⋅x(i)jfor j := 0…n }
Feature Scaling and Mean Normalization
We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value or the standard deviation) of the input variable, resulting in a new range of just 1.
Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.
So:
xi:=xi−μisi
μi is the average of all values for feature (i) si is the range of values (max - min), is the standard deviation.
Debugging Gradient Descent by Learning Rate
Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
If α is too small: slow convergence.
If α is too large: may not decrease on every iteration and thus may not converge.
Polynominal Regression
We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
Normal Equation
θ=(XTX)−1XTy
A method of finding the optimum theta without iteration.
No need to do feature scaling with the normal equation.
Gradient Descent | Normal Equation |
---|---|
Need to choose alpha | No need to choose alpha |
Needs many iterations | No need to iterate |
O(kn2) | O(n3) need to calculate (XTX)−1 |
Works well when n is large | Slow if n is very large |
Noninvertability
If XTX is noninvertible, the common causes might be having :
- Redundant features, where two features are very closely related (i.e. they are linearly dependent)
- Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).
When implementing the normal equation in octave we want to use the ‘pinv’ function rather than ‘inv.’