Evaluating machine learning algorithms, training set, cross validation set, test set, bias, variance, learning curves and improving algorithm performance.

1. Evaluating Learning Algorithm

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Deciding what to try next

  • Suppose you have implemented regularized linear regression to predict housing prices
    • However, when you test your hypothesis your hypothesis on new set of houses, you find that it makes unacceptably large errors
      • You can do the following
        • Get more training data
        • Smaller set of features
        • Get additional features
        • Try adding polynomial features
        • Try decreasing lambda
        • Try increasing lambda
      • Typically people randomly choose these avenues and then figure out it may not be suitable
      • There is a simple technique to weed out avenues that are not suitable
        • Machine Learning Diagnostic
          • Test that you can run to gain insight what is or isn’t working with a learning algorithm and gain guidance as to how best to improve its performance
          • Diagnostics can take time to implement, but doing so can be a very good use of your time
          • But it’s worth the time compared to spending months on unsuitable avenues

1b. Evaluating a hypothesis

  • In fitting parameters to your training data, you would want to lower your training error to the minimum
  • How to tell if over-fitting?
    • You can plot for few features
    • For many features: training/testing procedure
      • Split into 2 portions
        • Training set
        • Test set
          • Randomly re-order data before splitting
  • Training/testing procedure: linear regression
  • Training/testing procedure: logistic regression

1c. Model selection and Train/Validation/Test Sets

  • Model selection
    • We can create an extra parameter d which is the degree of polynomial
    • You can measure the test error on each parameter θ
      • If you choose d = 5 and to determine how well the model generalizes, you can report test set error on Jtest(θ5)
        • But there is a problem: Jtest(θ5) is likely an optimistic estimate of generalization error
  • To address the problem, we can do the following
    • Split data into 3 categories
      • Training set
      • Cross validation set or Validation set or CV
      • Test set
    • You would have the following 3 errors
      • Training error
      • Cross validation (CV) error
      • Test error
  • We would test on cross-validation sets
    • Pick hypothesis with lowest CV error

2. Bias vs Variance

2a. Diagnosing vs Variance

  • When you run an algorithm and it doesn’t do as well as you hope, it typically has a high bias or high variance issue
    • High bias (underfitting)
    • High variance (overfitting)
  • Plot error against degree of polynomial, d
    • As you increase your polynomial, d,
      • Training error decreases from underfitting to overfitting
      • Cross validation (CV) error example
        • d = 1: underfitting, high CV error
        • d = 2: lower CV error due to better fit
        • d = 4: overfitting, high CV error
  • How do we distinguish between a high bias or a high variance issue?
    • High Bias Error
      • High Jtrain(θ)
      • Jtrain(θ) = Jcv(θ)
    • High Variance Error
      • Low Jtrain(θ)
      • Jcv(θ) » Jtrain(θ)
        • Much greater as seen on the right of the graph

2b. Regularization and Bias/Variance

  • Linear regression with regularization
    • Large λ
      • High bias (underfit)
    • Small λ
      • High variance (overfit)
  • So how do we choose a good value of λ?
    • H(θ): algorithm; hypothesis
    • J(θ): cost function; optimization objective
    • Jtrain(θ), Jcv(θ), Jtest(θ): optimization objectives without regularization terms
    • Steps
      • Try λ in multiples of 2 on J(θ)
        • Minimise J(θ) to get θ
      • Try λ in multiples of 2 on Jcv(θ)
        • Minimise Jcv(θ) to get θ
      • Choose lowest Jcv(θ), θ_low
        • Where θ_low is θ_5 in the example since Jcv(θ_5) is the lowest
      • Pick Jcv(θ), θ_low
        • Try for Jtest(θ_low)
  • How CV and test error vary as we vary λ?
    • Jtrain(θ)
      • Small λ
        • Regularization term is small
        • Hypothesis fits better to the data
        • Low Jtrain(θ)
      • Large λ
        • Regularization term is large
        • Hypothesis does not fit well to the data
        • High Jtrain(θ)
    • Jcv(θ)
      • Large λ
        • Regularization term is large
        • High bias (underfitting)
        • Large Jcv(θ)
      • Small λ
        • Regularization term is small
        • High variance (overfitting)
        • Small Jcv(θ)
    • For a real dataset, the graph is messier, but the general trend is similar

2c. Learning Curves

  • What is the effect of m, number of training examples, on training error?
    • For m = 1, 2, 3 in the example
      • If the training set is small
      • Easier to fit every single training example perfectly
      • Your training error = 0 or small
    • For m = 4, 5, 6
      • If the training set grows larger
      • Harder to fit every single training example perfectly
      • Your training error increases
    • In general, when m increases, training error increases
  • What is the effect of m, number of training examples, on cross validation error?
    • The more data you have, where m increases
      • Your cross validation error decreases
  • High Bias (Underfit)
    • Poor performance on both training and test sets
    • Your cross validation error decreases, but it decreases to a high value
      • Even if you have large m, you still have a straight line with a high bias
      • Your cross validation error would still be high
    • Your training error increases close to the level achieve from your cross validation error
    • If a learning algorithm is suffering from high bias, getting more training data will not (by itself) help much
      • As seen from the two graphs, even with a higher m, there’s no use collecting more data to decrease your cross validation error
  • High Variance (Overfit)
    • Gap in errors where training error is low but test error is high
    • Training error would remain small
      • This happens when you use a small λ
      • Your training error increases with m because it becomes harder to fit your data
    • Cross validation error would remain high
      • This happens when you use a small λ
    • If a learning algorithm is suffering from high variance, getting more data is likely to help

2d. Improving Algorithm Performance

  • Suppose you have implemented regularized linear regression to predict housing prices
    • However, when you test your hypothesis your hypothesis on new set of houses, you find that it makes unacceptably large errors
      • You can do the following
        • Get more training data
          • Fixes high variance
        • Smaller set of features
          • Fixes high variance
            • Features are too complicated
        • Get additional features
          • Fixes high bias
            • Features are too simple
        • Try adding polynomial features
          • Fixes high bias
            • Too low d
        • Try decreasing lambda
          • Fixes high bias
            • Because you would have a smaller regularized term, giving more importance to other features
        • Try increasing lambda
          • Fixes high variance
            • Because you would have a larger regularized term, giving less importance to other features
  • Neural Networks and Overfitting
    • If you are fitting a neural network, you can use a small or large neural network
      • Small neural network
        • 1 hidden layer
        • 1 input layer
        • 1 output layer
          • Computationally cheaper
      • Large neural network
        • Multiple hidden layers
        • 1 input layer
        • 1 output layer
          • Computationally expensive