Evaluate bias and variance with a learning curve
Learning Curve

Learning Curve Theory

  • Graph that compares the performance of a model on training and testing data over a varying number of training instances
  • We should generally see performance improve as the number of training points increases
  • When we separate training and testing sets and graph them individually
    • We can get an idea of how well the model can generalize to new data
  • Learning curve allows us to verify when a model has learning as much as it can about the data
  • When it occurs
    1. The performances on the training and testing sets reach a plateau
    2. There is a consistent gap between the two error rates
  • The key is to find the sweet spot that minimizes bias and variance by finding the right level of model complexity
  • Of course with more data any model can improve, and different models may be optimal
  • For a more in-depth theoretical coverage of learning curves, you can view a guide by Andrew Ng that I have compiled here

Types of learning curves

  • Bad Learning Curve: High Bias
    • When training and testing errors converge and are high
      • No matter how much data we feed the model, the model cannot represent the underlying relationship and has high systematic errors
      • Poor fit
      • Poor generalization
  • Bad Learning Curve: High Variance
    • When there is a large gap between the errors
      • Require data to improve
      • Can simplify the model with fewer or less complex features
  • Ideal Learning Curve
    • Model that generalizes to new data
    • Testing and training learning curves converge at similar values
    • Smaller the gap, the better our model generalizes

Example 1: High Bias

  • In this example, you'll see that we'll be using a linear learner on quadratic data
  • The result is that we've high bias
  • We'll have a low score (high error)
In [1]:
# imports
from sklearn.linear_model import LinearRegression
from sklearn.learning_curve import learning_curve
import matplotlib.pyplot as plt
from sklearn.metrics import explained_variance_score, make_scorer
from sklearn.cross_validation import KFold
import numpy as np
In [2]:
size = 1000
cv = KFold(size, shuffle=True)

Create X array

In [3]:
#np.reshape(old_shape, new_shape)

# new array (-1, 1)
# -1 implies to take shape from original, hence 1000
# this creates a 1000 x 1 array
X = np.reshape(np.random.normal(scale=2,size=size),(-1,1))
X.shape
Out[3]:
(1000, 1)
In [4]:
# np.random.normal(scale=2,size=size) creates a 1000 x 1 matrix
# scale=2 is the standard deviation of the distribution
np.random.normal(scale=2,size=size).shape
Out[4]:
(1000,)

Create y array

In [5]:
y = np.array([[1 - 2*x[0] +x[0]**2] for x in X])
y.shape
Out[5]:
(1000, 1)

Plot learning curve

In [6]:
def plot_curve():
    # instantiate
    lg = LinearRegression()

    # fit
    lg.fit(X, y)
    
    
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
        
    x1 = np.linspace(0, 10, 8, endpoint=True) produces
        8 evenly spaced points in the range 0 to 10
    """
    
    train_sizes, train_scores, test_scores = learning_curve(lg, X, y, n_jobs=-1, cv=cv, train_sizes=np.linspace(.1, 1.0, 5), verbose=0)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.figure()
    plt.title("RandomForestClassifier")
    plt.legend(loc="best")
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    plt.gca().invert_yaxis()
    
    # box-like grid
    plt.grid()
    
    # plot the std deviation as a transparent range at each training set size
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
    
    # plot the average training and test score lines at each training set size
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    
    # sizes the window for readability and displays the plot
    # shows error from 0 to 1.1
    plt.ylim(-.1,1.1)
    plt.show()
In [8]:
%matplotlib inline
plot_curve()

Compared to the theory we covered, here our y-axis is 'score', not 'error', so the higher the score, the better the performance of the model.

  • Training score (red line) decreases and plateau
    • Indicates underfitting
    • High bias
  • Cross-validation score (green line) stagnating throughout
    • Unable to learn from data
  • Low scores (high errors)
    • Should tweak model (perhaps increase model complexity)

Example 2: High Variance

  • Noisy data and complex model


There're no inline notes here as the code is exactly the same as above and are already well explained.

In [12]:
from sklearn.tree import DecisionTreeRegressor

X = np.round(np.reshape(np.random.normal(scale=5,size=2*size),(-1,2)),2)
y = np.array([[np.sin(x[0]+np.sin(x[1]))] for x in X])

def plot_curve():
    # instantiate
    dt = DecisionTreeRegressor()

    # fit
    dt.fit(X, y)
    
    
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
        
    x1 = np.linspace(0, 10, 8, endpoint=True) produces
        8 evenly spaced points in the range 0 to 10
    """
    
    train_sizes, train_scores, test_scores = learning_curve(dt, X, y, n_jobs=-1, cv=cv, train_sizes=np.linspace(.1, 1.0, 5), verbose=0)

    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    
    plt.figure()
    plt.title("RandomForestClassifier")
    plt.legend(loc="best")
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    plt.gca().invert_yaxis()
    
    # box-like grid
    plt.grid()
    
    # plot the std deviation as a transparent range at each training set size
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
    
    # plot the average training and test score lines at each training set size
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
    
    # sizes the window for readability and displays the plot
    # shows error from 0 to 1.1
    plt.ylim(-.1,1.1)
    plt.show()

plot_curve()

Compared to the theory we covered, here our y-axis is 'score', not 'error', so the higher the score, the better the performance of the model.

  • Training score (red line) is at its maximum regardless of training examples
    • This shows severe overfitting
  • Cross-validation score (green line) increases over time
  • Huge gap between cross-validation score and training score indicates high variance scenario
    • Reduce complexity of the model or gather more data