Machine Learning introduction by Data School
06_linear_regression

## Topics¶

1. Types of supervised learning
3. Visualizing data using seaborn
4. Linear regression pros and cons
5. Form of linear regression
6. Preparing X and y using pandas
7. Splitting X and y into training and testing sets
8. Linear regression in scikit-learn
9. Interpreting model coefficients
10. Making predictions
11. Model evaluation metrics for regression
12. Computing the RMSE for our Sales predictions
13. Feature selection
14. Resources

This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos.

### 1. Types of supervised learning¶

• Classification: Predict a categorical response
• Regression: Predict a continuous response

### 2. Reading data using pandas¶

Pandas: popular Python library for data exploration, manipulation, and analysis

In [1]:
# conventional way to import pandas
import pandas as pd

In [2]:
# read CSV file directly from a URL and save the results
# use .read_csv method and simply pass in the name of the files (local and through a url)
# to find out more about this method, click on the url and press shift + tab (twice)
# we set the index_col=0

# display the first 5 rows

Out[2]:
1 230.1 37.8 69.2 22.1
2 44.5 39.3 45.1 10.4
3 17.2 45.9 69.3 9.3
4 151.5 41.3 58.5 18.5
5 180.8 10.8 58.4 12.9

Primary object types:

• DataFrame: rows and columns (like a spreadsheet or matrix)
• First row will always be the column headers
• First column is an index
• Series: a single column (vector)
In [3]:
# display the last 5 rows
data.tail()

Out[3]:
196 38.2 3.7 13.8 7.6
197 94.2 4.9 8.1 9.7
198 177.0 9.3 6.4 12.8
199 283.6 42.0 66.2 25.5
200 232.1 8.6 8.7 13.4
In [4]:
# check the shape of the DataFrame (rows, columns)
# there are 200 rows x 4 columns
data.shape

Out[4]:
(200, 4)

What are the features?

• TV: advertising dollars spent on TV for a single product in a given market (in thousands of dollars)
• Newspaper: advertising dollars spent on Newspaper

What is the response?

• Sales: sales of a single product in a given market (in thousands of items)

What else do we know?

• Because the response variable is continuous, this is a regression problem.
• There are 200 observations (represented by the rows), and each observation is a single market.

### 3. Visualizing data using seaborn¶

Seaborn: Python library for statistical data visualization built on top of Matplotlib

In [5]:
# conventional way to import seaborn
import seaborn as sns

# allow plots to appear within the notebook
%matplotlib inline

In [6]:
# visualize the relationship between the features and the response using scatterplots
# this produces pairs of scatterplot as shown
# use aspect= to control the size of the graphs
# use kind='reg' to plot linear regression on the graph
sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')

Out[6]:
<seaborn.axisgrid.PairGrid at 0x119e68198>

Linear regression

• Strong relationship between TV ads and sales
• Very weak to no relationship between Newspaper ads and sales

### 4. Linear regression Pros and Cons¶

Pros:

• Fast
• No tuning required
• Highly interpretable
• Well-understood

Cons:

• Unlikely to produce the best predictive accuracy
• Presumes a linear relationship between the features and response
• If the relationship is highly non-linear as with many scenarios, linear relationship will not effectively model the relationship and its prediction would not be accurate

### 5. Form of linear regression¶

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

• $y$ is the response
• $\beta_0$ is the intercept
• $\beta_1$ is the coefficient for $x_1$ (the first feature)
• $\beta_n$ is the coefficient for $x_n$ (the nth feature)

In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

The $\beta$ values are called the model coefficients

• These values are "learned" during the model fitting step using the "least squares" criterion
• Then, the fitted model can be used to make predictions

### 6. Preparing X and y using pandas¶

• scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays
• However, pandas is built on top of NumPy
• Thus, X can be a pandas DataFrame (matrix) and y can be a pandas Series (vector)
In [7]:
# create a Python list of feature names

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# equivalent command to do this in one line using double square brackets
# inner bracket is a list
# outer bracker accesses a subset of the original DataFrame

# print the first 5 rows

Out[7]:
1 230.1 37.8 69.2
2 44.5 39.3 45.1
3 17.2 45.9 69.3
4 151.5 41.3 58.5
5 180.8 10.8 58.4
In [8]:
# check the type and shape of X
print(type(X))
print(X.shape)

<class 'pandas.core.frame.DataFrame'>
(200, 3)

In [9]:
# select a Series from the DataFrame
y = data['Sales']

# equivalent command that works if there are no spaces in the column name
# you can select the Sales as an attribute of the DataFrame
y = data.Sales

# print the first 5 values

Out[9]:
1    22.1
2    10.4
3     9.3
4    18.5
5    12.9
Name: Sales, dtype: float64
In [10]:
# check the type and shape of y
print(type(y))
print(y.shape)

<class 'pandas.core.series.Series'>
(200,)


### 7. Splitting X and y into training and testing sets¶

In [11]:
# import
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [12]:
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(150, 3)
(50, 3)
(150,)
(50,)


### 8. Linear regression in scikit-learn¶

In [13]:
# import model
from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

Out[13]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

### 9. Interpreting model coefficients¶

In [14]:
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

2.87696662232
[ 0.04656457  0.17915812  0.00345046]

In [15]:
# pair the feature names with the coefficients
# hard to remember the order, we so we python's zip function to pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)

Out[15]:
<zip at 0x11d372448>
$$y = 2.88 + 0.0466 \times TV + 0.179 \times Radio + 0.00345 \times Newspaper$$

How do we interpret the TV coefficient (0.0466)?

• For a given amount of Radio and Newspaper ad spending, a "unit" increase in TV ad spending is associated with a 0.0466 "unit" increase in Sales.
• Or more clearly: For a given amount of Radio and Newspaper ad spending, an additional $1,000 spent on TV ads is associated with an increase in sales of 46.6 items. Important notes: • This is a statement of association, not causation • If an increase in TV ad spending was associated with a decrease in sales,$\beta_1\$ would be negative.

### 10. Making predictions¶

In [16]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)


We need an evaluation metric in order to compare our predictions with the actual values.

### 11. Model evaluation metrics for regression¶

Evaluation metrics for classification problems, such as accuracy, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

Let's create some example numeric predictions, and calculate three common evaluation metrics for regression problems:

In [17]:
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]


Mean Absolute Error (MAE) is the mean of the absolute value of the errors:

$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$
In [18]:
# calculate MAE by hand
print((10 + 0 + 20 + 10) / 4)

# calculate MAE using scikit-learn
from sklearn import metrics
print(metrics.mean_absolute_error(true, pred))

10.0
10.0


Mean Squared Error (MSE) is the mean of the squared errors:

$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$
In [19]:
# calculate MSE by hand
import numpy as np
print((10**2 + 0**2 + 20**2 + 10**2) / 4)

# calculate MSE using scikit-learn
print(metrics.mean_squared_error(true, pred))

150.0
150.0


Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:

$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$
In [20]:
# calculate RMSE by hand
import numpy as np
print(np.sqrt(((10**2 + 0**2 + 20**2 + 10**2) / 4)))

# calculate RMSE using scikit-learn
print(np.sqrt(metrics.mean_squared_error(true, pred)))

12.2474487139
12.2474487139


Comparing these metrics:

• MAE is the easiest to understand, because it's the average error.
• MSE is more popular than MAE, because MSE "punishes" larger errors.
• RMSE is even more popular than MSE, because RMSE is interpretable in the "y" units.
• Easier to put in context as it's the same units as our response variable

### 12. Computing the RMSE for our Sales predictions¶

In [21]:
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.40465142303


### 13. Feature selection¶

Does Newspaper "belong" in our model? In other words, does it improve the quality of our predictions?

Let's remove it from the model and check the RMSE!

In [22]:
# create a Python list of feature names

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# select a Series from the DataFrame
y = data.Sales

# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

# make predictions on the testing set
y_pred = linreg.predict(X_test)

# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

1.38790346994


The RMSE decreased when we removed Newspaper from the model. (Error is something we want to minimize, so a lower number for RMSE is better.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.

### 14. Resources¶

Linear regression:

Pandas:

Seaborn:

Tags: