Processing math: 100%
avatar

Guocheng Wei

I will share my side projects, paper readings, trending news, and my life here with you.

Week 2 - Machine Learning

Multivariate Linear Regression, Normal Equation, Octave Tutorial

Guocheng Wei

3 minute read

ml stanford thumbnail

Mutiple Features

Linear regression with multiple variables is also known as multivariate linear regression.

The notation for equations:

x(i)j=value of feature j in the ith training example

x(i)=the input (features) of the ith training example

m=the number of training examples

n=the number of features

The multivariable form of the hypothesis function:

hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3++θnxn

Assume x(i)0=1 for (i1,,m)

Then, multivariable hypothesis function can be concisely represented as:

hθ(x)=[θ0θ1θn][x0x1xn]=θTx


Gradient Descent for Multiple Variables

Repeat until convergence: { θj:=θjα1mmi=1(hθ(x(i))y(i))x(i)jfor j := 0…n }


Feature Scaling and Mean Normalization

We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.

Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value or the standard deviation) of the input variable, resulting in a new range of just 1.

Mean normalization involves subtracting the average value for an input variable from the values for that input variable resulting in a new average value for the input variable of just zero.

So:

xi:=xiμisi

μi is the average of all values for feature (i) si is the range of values (max - min), is the standard deviation.


Debugging Gradient Descent by Learning Rate

Make a plot with number of iterations on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.

Plot With Number of Iterations

If α is too small: slow convergence.
If α is too large: may not decrease on every iteration and thus may not converge.


Polynominal Regression

We can change the behavior or curve of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).


Normal Equation

θ=(XTX)1XTy

A method of finding the optimum theta without iteration.
No need to do feature scaling with the normal equation.

Proof of the normal equation

Gradient Descent Normal Equation
Need to choose alpha No need to choose alpha
Needs many iterations No need to iterate
O(kn2) O(n3) need to calculate (XTX)1
Works well when n is large Slow if n is very large

Noninvertability

If XTX is noninvertible, the common causes might be having :

  • Redundant features, where two features are very closely related (i.e. they are linearly dependent)
  • Too many features (e.g. m ≤ n). In this case, delete some features or use “regularization” (to be explained in a later lesson).

When implementing the normal equation in octave we want to use the ‘pinv’ function rather than ‘inv.’

Recent posts

See more

Categories

About

Passion in core network, ML, and software engineer. Love food, hiking, and snow skiing.