## Topics¶

- Types of supervised learning
- Reading data using pandas
- Visualizing data using seaborn
- Linear regression pros and cons
- Form of linear regression
- Preparing X and y using pandas
- Splitting X and y into training and testing sets
- Linear regression in scikit-learn
- Interpreting model coefficients
- Making predictions
- Model evaluation metrics for regression
- Computing the RMSE for our Sales predictions
- Feature selection
- Resources

*This tutorial is derived from Data School's Machine Learning with scikit-learn tutorial. I added my own notes so anyone, including myself, can refer to this tutorial without watching the videos.*

### 1. Types of supervised learning¶

**Classification:**Predict a categorical response**Regression:**Predict a continuous response

### 2. Reading data using pandas¶

**Pandas:** popular Python library for data exploration, manipulation, and analysis

- Anaconda users: pandas is already installed
- Other users: installation instructions

```
# conventional way to import pandas
import pandas as pd
```

```
# read CSV file directly from a URL and save the results
# use .read_csv method and simply pass in the name of the files (local and through a url)
# to find out more about this method, click on the url and press shift + tab (twice)
# we set the index_col=0
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
# display the first 5 rows
data.head()
```

Primary object types:

**DataFrame:**rows and columns (like a spreadsheet or matrix)- First row will always be the column headers
- First column is an index

**Series:**a single column (vector)

```
# display the last 5 rows
data.tail()
```

```
# check the shape of the DataFrame (rows, columns)
# there are 200 rows x 4 columns
data.shape
```

What are the features?

**TV:**advertising dollars spent on TV for a single product in a given market (in thousands of dollars)**Radio:**advertising dollars spent on Radio**Newspaper:**advertising dollars spent on Newspaper

What is the response?

**Sales:**sales of a single product in a given market (in thousands of items)

What else do we know?

- Because the response variable is continuous, this is a
**regression**problem. - There are 200
**observations**(represented by the rows), and each observation is a single market.

### 3. Visualizing data using seaborn¶

**Seaborn:** Python library for statistical data visualization built on top of Matplotlib

- Anaconda users: run
from the command line`conda install seaborn`

- Other users: installation instructions

```
# conventional way to import seaborn
import seaborn as sns
# allow plots to appear within the notebook
%matplotlib inline
```

```
# visualize the relationship between the features and the response using scatterplots
# this produces pairs of scatterplot as shown
# use aspect= to control the size of the graphs
# use kind='reg' to plot linear regression on the graph
sns.pairplot(data, x_vars=['TV', 'Radio', 'Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')
```

Linear regression

- Strong relationship between TV ads and sales
- Weak relationship between Radio ads and sales
- Very weak to no relationship between Newspaper ads and sales

### 4. Linear regression Pros and Cons¶

**Pros:**

- Fast
- No tuning required
- Highly interpretable
- Well-understood

**Cons:**

- Unlikely to produce the best predictive accuracy
- Presumes a linear relationship between the features and response
- If the relationship is highly non-linear as with many scenarios, linear relationship will not effectively model the relationship and its prediction would not be accurate

### 5. Form of linear regression¶

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)

In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

The $\beta$ values are called the **model coefficients**

- These values are "learned" during the model fitting step using the "least squares" criterion
- Then, the fitted model can be used to make predictions

### 6. Preparing X and y using pandas¶

- scikit-learn expects X (feature matrix) and y (response vector) to be NumPy arrays
- However, pandas is built on top of NumPy
- Thus, X can be a pandas DataFrame (matrix) and y can be a pandas Series (vector)

```
# create a Python list of feature names
feature_cols = ['TV', 'Radio', 'Newspaper']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# equivalent command to do this in one line using double square brackets
# inner bracket is a list
# outer bracker accesses a subset of the original DataFrame
X = data[['TV', 'Radio', 'Newspaper']]
# print the first 5 rows
X.head()
```

```
# check the type and shape of X
print(type(X))
print(X.shape)
```

```
# select a Series from the DataFrame
y = data['Sales']
# equivalent command that works if there are no spaces in the column name
# you can select the Sales as an attribute of the DataFrame
y = data.Sales
# print the first 5 values
y.head()
```

```
# check the type and shape of y
print(type(y))
print(y.shape)
```

### 7. Splitting X and y into training and testing sets¶

```
# import
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
```

```
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)
```

### 8. Linear regression in scikit-learn¶

```
# import model
from sklearn.linear_model import LinearRegression
# instantiate
linreg = LinearRegression()
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
```

### 9. Interpreting model coefficients¶

```
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)
```

```
# pair the feature names with the coefficients
# hard to remember the order, we so we python's zip function to pair the feature names with the coefficients
zip(feature_cols, linreg.coef_)
```

How do we interpret the **TV coefficient** (0.0466)?

- For a given amount of Radio and Newspaper ad spending,
**a "unit" increase in TV ad spending**is associated with a**0.0466 "unit" increase in Sales**. - Or more clearly: For a given amount of Radio and Newspaper ad spending,
**an additional $1,000 spent on TV ads**is associated with an**increase in sales of 46.6 items**.

Important notes:

- This is a statement of
**association**, not**causation** - If an increase in TV ad spending was associated with a
**decrease**in sales, $\beta_1$ would be**negative**.

### 10. Making predictions¶

```
# make predictions on the testing set
y_pred = linreg.predict(X_test)
```

We need an **evaluation metric** in order to compare our predictions with the actual values.

### 11. Model evaluation metrics for regression¶

Evaluation metrics for classification problems, such as **accuracy**, are not useful for regression problems. Instead, we need evaluation metrics designed for comparing continuous values.

Let's create some example numeric predictions, and calculate **three common evaluation metrics** for regression problems:

```
# define true and predicted response values
true = [100, 50, 30, 20]
pred = [90, 50, 50, 30]
```

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

```
# calculate MAE by hand
print((10 + 0 + 20 + 10) / 4)
# calculate MAE using scikit-learn
from sklearn import metrics
print(metrics.mean_absolute_error(true, pred))
```

**Mean Squared Error** (MSE) is the mean of the squared errors:

```
# calculate MSE by hand
import numpy as np
print((10**2 + 0**2 + 20**2 + 10**2) / 4)
# calculate MSE using scikit-learn
print(metrics.mean_squared_error(true, pred))
```

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

```
# calculate RMSE by hand
import numpy as np
print(np.sqrt(((10**2 + 0**2 + 20**2 + 10**2) / 4)))
# calculate RMSE using scikit-learn
print(np.sqrt(metrics.mean_squared_error(true, pred)))
```

Comparing these metrics:

**MAE**is the easiest to understand, because it's the average error.**MSE**is more popular than MAE, because MSE "punishes" larger errors.**RMSE**is even more popular than MSE, because RMSE is interpretable in the "y" units.- Easier to put in context as it's the same units as our response variable

### 12. Computing the RMSE for our Sales predictions¶

```
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
```

### 13. Feature selection¶

Does **Newspaper** "belong" in our model? In other words, does it improve the quality of our predictions?

Let's **remove it** from the model and check the RMSE!

```
# create a Python list of feature names
feature_cols = ['TV', 'Radio']
# use the list to select a subset of the original DataFrame
X = data[feature_cols]
# select a Series from the DataFrame
y = data.Sales
# split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
# make predictions on the testing set
y_pred = linreg.predict(X_test)
# compute the RMSE of our predictions
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
```

The RMSE **decreased** when we removed Newspaper from the model. (Error is something we want to minimize, so **a lower number for RMSE is better**.) Thus, it is unlikely that this feature is useful for predicting Sales, and should be removed from the model.

### 14. Resources¶

Linear regression:

- Longer notebook on linear regression by Data School
- Chapter 3 of An Introduction to Statistical Learning and related videos by Hastie and Tibshirani (Stanford)
- Quick reference guide to applying and interpreting linear regression by Data School
- Introduction to linear regression by Robert Nau (Duke)

Pandas:

- Three-part pandas tutorial by Greg Reda
- read_csv and read_table documentation

Seaborn: