Week 6 - Machine Learning - Guocheng's Space

Week 6 - Machine Learning

Model Selection, ML Diagnostic Method, DNN Diagnostic Method, Error Metrics for Skewed Classes, Vanishing / Exploding Gradient

July 6, 2019 Guocheng Wei

4 minute read

Deciding What to Try Next

Errors in your predictions can be troubleshooted by:

Getting more training examples
Trying smaller sets of features
Trying additional features
Trying polynomial features
Increasing or decreasing $\lambda$

Model Selection and Train/Validation/Test Sets

Test Error

$$ J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2 $$

Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis.
The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than any other data set.

Use of the CV set

To solve this, we can introduce a third set, the Cross Validation Set, to serve as an intermediate set that we can train d with. Then our test set will give us an accurate, non-optimistic error.

Model selection:

Optimize the parameters in $\Theta$ using the training set for each polynomial degree.
Find the polynomial degree d with the least error using the cross validation set.
Estimate the generalization error using the test set with $J_{test}(\Theta^{(d)})$ (d = theta from polynomial with lower error);

CV Model

Diagnosing Bias vs. Variance

The training error will tend to decrease as we increase the degree d of the polynomial.

At the same time, the cross validation error will tend to decrease as we increase d up to a point, and then it will increase as d is increased, forming a convex curve.

Bias vs. Variance

Regularization and Bias/Variance

Large $\lambda$: High bias (underfitting)
Intermediate $\lambda$: just right
Small $\lambda$: High variance (overfitting)

Large/Small Lambda

In order to choose the model and the regularization $\lambda$, we need:

Create a list of lambdas (i.e. $\lambda \in\{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24\}$);
Create a set of models with different degrees or any other variants.
Iterate through the $\lambda$s and for each $\lambda$ go through all the models to learn some $\Theta$.
Compute the cross validation error using the learned $\lambda$ (computed with $\lambda$) on the $J_{CV}(\Theta)$ without regularization or $\lambda = 0$.
Select the best combo that produces the lowest error on the cross validation set.
Using the best combo $\Theta$ and $\lambda$, apply it on $J_{test}(\Theta)$ to see if it has a good generalization of the problem.

Learning Curves

With high bias:

Low training set size: causes $J_{train}(\Theta)$ to be low and $J_{CV}(\Theta)$ to be high.
Large training set size: causes both $J_{train}(\Theta)$ and $J_{CV}(\Theta)$ to be high with $J_{train}(\Theta) \approx J_{CV}(\Theta)$.

If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much.

Learning Curve with High Bias

With high variance:

Low training set size: $J_{train}(\Theta)$ will be low and $J_{CV}(\Theta)$ will be high.
Large training set size: $J_{train}(\Theta)$ increases with training set size and $J_{CV}(\Theta)$ continues to decrease without leveling off. Also, $J_{train}(\Theta) < J_{CV}(\Theta)$ but the difference between them remains significant.

Learning Curve with High Variance

Deciding What to Do Next Revisited

Method	Usage
Getting more training examples	Fixes high variance
Trying smaller sets of features	Fixes high variance
Adding features	Fixes high bias
Adding polynomial features	Fixes high bias
Decreasing $\lambda$	Fixes high bias
Increasing $\lambda$	Fixes high variance

Diagnosing Deep Neural Networks

A neural network with fewer parameters is prone to underfitting. It is also computationally cheaper.
A large neural network with more parameters is prone to overfitting. It is also computationally expensive. In this case you can use regularization (increase $\lambda$) to address the overfitting.

Error Metrics for Skewed Classes

It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm.

For example: In predicting a cancer diagnoses where 0.5% of the examples have cancer, we find our learning algorithm has a 1% error. However, if we were to simply classify every single example as a 0, then our error would reduce to 0.5% even though we did not improve the algorithm.

This usually happens with skewed classes.

For this we can use Precision/Recall

Error Table

Precision: $$ \dfrac{\text{True Positives}}{\text{Total number of predicted positives}} = \dfrac{\text{True Positives}}{\text{True Positives}+\text{False Positives}} $$

Recall: $$ \dfrac{\text{True Positives}}{\text{Total number of actual positives}} = \dfrac{\text{True Positives}}{\text{True Positives}+\text{False Negatives}} $$

By setting the threshold higher (i.e, $h_\theta(x) \geq 0.7$ predict 1), you can get a confident prediction, higher precision but lower recall

By setting the threshold lower (i.e, $h_\theta(x) \geq 0.3$ predict 1), you can get a safe prediction, higher recall but lower precision

Use F score to leverage the two metrics: $$\text{F score} = 2 \frac{PR}{P+R}$$

Guocheng Wei