## Contents¶

- Introduction
- Libraries
- Example: Advertising Data
- Questions About the Advertising Data
- Simple Linear Regression
- Estimating ("Learning") Model Coefficients
- Interpreting Model Coefficients
- Using the Model for Prediction
- Plotting the Least Squares Line
- Confidence in our Model
- Hypothesis Testing and p-values
- How Well Does the Model Fit the data?
- Multiple Linear Regression
- Feature Selection
- Model Evaluation Metrics for Regression
- Model Evaluation Using Train/Test Split
- Handling Categorical Features with Two Categories
- Handling Categorical Features with More than Two Categories

*This tutorial is derived from Kevin Markham's tutorial on Linear Regression but modified for compatibility with Python 3.*

## 1. Introduction¶

**Regression problems**are supervised learning problems in which the response is continuous**Linear regression**is a technique that is useful for regression problems.

**Classification problems**are supervised learning problems in which the response is categorical

**Benefits** of linear regression

- widely used
- runs fast
- easy to use (not a lot of tuning required)
- highly interpretable
- basis for many other methods

## 2. Libraries¶

```
# imports
import pandas as pd
import seaborn as sns
import statsmodels.formula.api as smf
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from sklearn.cross_validation import train_test_split
import numpy as np
# allow plots to appear directly in the notebook
%matplotlib inline
```

## 3. Example: Advertising Data¶

Let's take a look at some data, ask some questions about that data, and then use linear regression to answer those questions!

```
# read data into a DataFrame
data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data.head()
```

```
# shape of the DataFrame
data.shape
```

```
# visualize the relationship between the features and the response using scatterplots
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7)
```

## 4. Questions About the Advertising Data¶

- Let's pretend you work for the company that manufactures and markets this widget
- The company might ask you the following: On the basis of this data, how should we spend our advertising money in the future?
- This general question might lead you to more specific questions:
- Is there a relationship between ads and sales?
- How strong is that relationship?
- Which ad types contribute to sales?
- What is the effect of each ad type of sales?
- Given ad spending in a particular market, can sales be predicted?

**We will explore these questions below.**

## 5. Simple Linear Regression¶

- Simple linear regression is an approach for predicting a
**quantitative response**using a**single feature**(or "predictor" or "input variable") - It takes the following form:
- $y = \beta_0 + \beta_1x$

What does each term represent?

- $y$ is the response
- $x$ is the feature
- $\beta_0$ is the intercept
$\beta_1$ is the coefficient for x

$\beta_0$ and $\beta_1$ are called the

**model coefficients**

To create your model, you must "learn" the values of these coefficients. Once we've learned these coefficients, we can use the model to predict Sales.

## 6. Estimating ("Learning") Model Coefficients¶

- Coefficients are estimated using the
**least squares criterion**- In other words, we find the line (mathematically) which minimizes the
**sum of squared residuals**(or "sum of squared errors"):

- In other words, we find the line (mathematically) which minimizes the

What elements are present in the diagram?

- The black dots are the
**observed values**of x and y - The blue line is our
**least squares line** - The red lines are the
**residuals**, which are the distances between the observed values and the least squares line

How do the model coefficients relate to the least squares line?

- $\beta_0$ is the
**intercept**(the value of $y$ when $x$=0) - $\beta_1$ is the
**slope**(the change in $y$ divided by change in $x$)

Here is a graphical depiction of those calculations:

**Let's estimate the model coefficients for the advertising data**

```
### STATSMODELS ###
# create a fitted model
lm1 = smf.ols(formula='Sales ~ TV', data=data).fit()
# print the coefficients
lm1.params
```

```
### SCIKIT-LEARN ###
# create X and y
feature_cols = ['TV']
X = data[feature_cols]
y = data.Sales
# instantiate and fit
lm2 = LinearRegression()
lm2.fit(X, y)
# print the coefficients
print(lm2.intercept_)
print(lm2.coef_)
```

## 7. Interpreting Model Coefficients¶

Interpreting the TV coefficient ($\beta_1$)

- A "unit" increase in TV ad spending is
**associated with**a 0.047537 "unit" increase in Sales - Or more clearly: An additional $1,000 spent on TV ads is
**associated with**an increase in sales of 47.537 widgets - Note here that the coefficients represent associations, not causations

## 8. Using the Model for Prediction¶

Let's say that there was a new market where the TV advertising spend was **$50,000**. What would we predict for the Sales in that market?

*We would use 50 instead of 50,000 because the original data consists of examples that are divided by 1000*

**8a. Manual Prediction**

```
# manually calculate the prediction
7.032594 + 0.047537*50
```

**8b. Statsmodels Prediction**

```
### STATSMODELS ###
# you have to create a DataFrame since the Statsmodels formula interface expects it
X_new = pd.DataFrame({'TV': [50]})
# predict for a new observation
lm1.predict(X_new)
```

**8c. Scikit-learn Prediction**

```
### SCIKIT-LEARN ###
# predict for a new observation
lm2.predict(50)
```

Thus, we would predict Sales of **9,409 widgets** in that market.

## 9. Plotting the Least Squares Line¶

```
sns.pairplot(data, x_vars=['TV','Radio','Newspaper'], y_vars='Sales', size=7, aspect=0.7, kind='reg')
```

## 10. Confidence in our Model¶

**Question:** Is linear regression a high variance/low bias model, or a low variance/high bias model?

**Answer:**

- Low variance/high bias
- Under repeated sampling, the line will stay roughly in the same place (low variance)
- But the average of those models won't do a great job capturing the true relationship (high bias)

- Note that low variance is a useful characteristic when you don't have a lot of training data

A closely related concept is **confidence intervals**

- Statsmodels calculates 95% confidence intervals for our model coefficients, which are interpreted as follows:
- If the population from which this sample was drawn was
**sampled 100 times**- Approximately
**95 of those confidence intervals**would contain the "true" coefficient

- Approximately

- If the population from which this sample was drawn was

```
### STATSMODELS ###
# print the confidence intervals for the model coefficients
lm1.conf_int()
```

- We only have a
**single sample of data**, and not the**entire population of data** - The "true" coefficient is either within this interval or it isn't, but there's no way to actually know
- We estimate the coefficient with the data we do have, and we show uncertainty about that estimate by giving a range that the coefficient is
**probably**within

Note that using 95% confidence intervals is just a convention

- You can create 90% confidence intervals (which will be more narrow)
- 99% confidence intervals (which will be wider)
- or whatever intervals you like.

## 11. Hypothesis Testing and p-values¶

**Steps for Hypothesis Testing**

- Start with a
**null hypothesis**and an**alternative hypothesis**(that is opposite the null) - Then, you check whether the data supports
**rejecting the null hypothesis**or**failing to reject the null hypothesis**- "failing to reject" the null is not the same as "accepting" the null hypothesis
- The alternative hypothesis may indeed be true, except that you just don't have enough data to show that

**Conventional hypothesis test**

**null hypothesis:**- There is no relationship between TV ads and Sales
- $\beta_1$ equals zero

- There is no relationship between TV ads and Sales
**alternative hypothesis:**- There is a relationship between TV ads and Sales
- $\beta_1$ is not equal to zero

- There is a relationship between TV ads and Sales

**Testing hypothesis**

- Reject the null
- There is a relationship
- If the 95% confidence interval
**does not include zero**

- Fail to reject the null
- There is no relationship
- If the 95% confidence interval
**includes zero**

```
### STATSMODELS ###
# print the p-values for the model coefficients
lm1.pvalues
```

**p-value**

- Represents the probability that the coefficient is actually zero

**Interpreting p-values**

- If the 95% confidence interval
**does not include zero**- p-value will be
**less than 0.05** - Reject the null
- There is a relationship

- p-value will be
- If the 95% confidence interval
**includes zero**- p-value for that coefficient will be
**greater than 0.05** - Fail to reject the null
- There is no relationship

- p-value for that coefficient will be

**Notes**

- p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response
- In this case, the p-value for TV is far less than 0.05
- Low probability coefficient actually zero
- Reject null hypothesis
- There is a relationship
**Believe**that there is a relationship between TV ads and Sales

- We generally ignore the p-value for the intercept

## 12. How Well Does the Model Fit the data?¶

To evaluate the overall fit of a linear model, we use the **R-squared** value

- R-squared is the
**proportion of variance explained**- It is the proportion of variance in the observed data that is explained by the model, or the reduction in error over the
**null model**- The null model just predicts the mean of the observed response, and thus it has an intercept and no slope

- It is the proportion of variance in the observed data that is explained by the model, or the reduction in error over the
- R-squared is between 0 and 1
- Higher values are better because it means that more variance is explained by the model.

Here's an example of what R-squared "looks like":

**Diagram explanation**

**Blue line**explains some of the variance in the data (R-squared=0.54)**Green line**explains more of the variance (R-squared=0.64)**Red line**fits the training data even further (R-squared=0.66)

Let's calculate the R-squared value for our simple linear model:

```
### STATSMODELS ###
# print the R-squared value for the model
lm1.rsquared
```

```
### SCIKIT-LEARN ###
# print the R-squared value for the model
lm2.score(X, y)
```

**Is that a "good" R-squared value?**

- It's hard to say
- The threshold for a good R-squared value depends widely on the domain
- Therefore, it's most useful as a tool for
**comparing different models**

## 13. Multiple Linear Regression¶

Simple linear regression can easily be extended to include multiple features. This is called **multiple linear regression**:

$y = \beta_0 + \beta_1x_1 + ... + \beta_nx_n$

Each $x$ represents a different feature, and each feature has its own coefficient. In this case:

$y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper$

Let's estimate these coefficients:

```
### STATSMODELS ###
# create a fitted model with all three features
lm1 = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
# print the coefficients
lm1.params
```

```
### SCIKIT-LEARN ###
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales
# instantiate and fit
lm2 = LinearRegression()
lm2.fit(X, y)
# print the coefficients
print(lm2.intercept_)
print(lm2.coef_)
```

```
# pair the feature names with the coefficients
list(zip(feature_cols, lm2.coef_))
```

**Interpreting coefficients**

- For a given amount of Radio and Newspaper ad spending, an
**increase of $1000 in TV ad spending**is associated with an**increase in Sales of 45.765 widgets**.

A lot of the information we have been reviewing piece-by-piece is available in the Statsmodels model summary output:

```
### STATSMODELS ###
# print a summary of the fitted model
lm1.summary()
```

What are a few key things we learn from this output?

- TV and Radio have small
**p-values**, whereas Newspaper have a large p-value- Reject the null hypothesis for TV and Radio
- There is association between features and Sales

- Fail to reject the null hypothesis for Newspaper
- There is no association

- Reject the null hypothesis for TV and Radio
- TV and Radio ad spending are both
**positively associated**with Sales- Newspaper ad spending is
**slightly negatively associated**with Sales- However, this is irrelevant since we have failed to reject the null hypothesis for Newspaper

- Newspaper ad spending is
- This model has a higher
**R-squared**(0.897) than the previous model- This model provides a better fit to the data than a model that only includes TV

## 14. Feature Selection¶

Deciding **which features to include** in a linear model

- Try different models
- Keep features in the model if they have small p-values
- Reject null hypothesis
- Relationship exists

- Check whether the R-squared value goes up when you add new features

**Drawbacks** to this approach?

- Linear models rely upon a lot of
**assumptions**- Features being independent
- If assumptions are violated (which they usually are), R-squared and p-values are less reliable

- Features being independent
- Using a p-value cutoff of 0.05 means that if you add 100 features to a model that are
**pure noise**, 5 of them (on average) will still be counted as significant - R-squared is susceptible to
**overfitting**, and thus there is no guarantee that a model with a high R-squared value will generalize. Below is an example:

```
### STATSMODELS ###
# only include TV and Radio in the model
# instantiate and fit model
lm1 = smf.ols(formula='Sales ~ TV + Radio', data=data).fit()
# calculate r-square
lm1.rsquared
```

```
# add Newspaper to the model (which we believe has no association with Sales)
lm1 = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
lm1.rsquared
```

**Issure with R-squared**

**R-squared will always increase as you add more features to the model**, even if they are unrelated to the response- Selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.

**Solution**

**Adjusted R-squared**- Penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.

**Better Solution**

**Train/test split**or**cross-validation**- More reliable estimate of out-of-sample error
- Better for choosing which of your models will best
**generalize**to out-of-sample data

- Better for choosing which of your models will best
- There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models
- Importantly, cross-validation can be applied to
**any model**, whereas the methods described above only apply to**linear models**

## 15. Model Evaluation Metrics for Regression¶

For classification problems, we have only used classification accuracy as our evaluation metric. What metrics can we used for regression problems?

**Mean Absolute Error** (MAE) is the mean of the absolute value of the errors:

**Mean Squared Error** (MSE) is the mean of the squared errors:

**Root Mean Squared Error** (RMSE) is the square root of the mean of the squared errors:

Let's calculate these by hand, to get an intuitive sense for the results:

```
# define true and predicted response values
y_true = [100, 50, 30, 20]
y_pred = [90, 50, 50, 30]
# calculate MAE, MSE, RMSE
print(metrics.mean_absolute_error(y_true, y_pred))
print(metrics.mean_squared_error(y_true, y_pred))
print(np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
```

MSE is more popular than MAE because MSE "punishes" larger errors. But, RMSE is even more popular than MSE because RMSE is interpretable in the "y" units.

## 16. Model Evaluation Using Train/Test Split¶

Let's use train/test split with RMSE to see whether Newspaper should be kept in the model:

```
# include Newspaper
X = data[['TV', 'Radio', 'Newspaper']]
y = data.Sales
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Instantiate model
lm2 = LinearRegression()
# Fit Model
lm2.fit(X_train, y_train)
# Predict
y_pred = lm2.predict(X_test)
# RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
```

```
# exclude Newspaper
X = data[['TV', 'Radio']]
y = data.Sales
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
# Instantiate model
lm2 = LinearRegression()
# Fit model
lm2.fit(X_train, y_train)
# Predict
y_pred = lm2.predict(X_test)
# RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
```

## 17. Handling Categorical Features with Two Categories¶

Up to now, all of our features have been numeric. What if one of our features was categorical?

Let's create a new feature called **Size**, and randomly assign observations to be **small or large**:

```
# set a seed for reproducibility
np.random.seed(12345)
# create a Series of booleans in which roughly half are True
nums = np.random.rand(len(data))
mask_large = nums > 0.5
# initially set Size to small, then change roughly half to be large
data['Size'] = 'small'
# Series.loc is a purely label-location based indexer for selection by label
data.loc[mask_large, 'Size'] = 'large'
data.head()
```

For scikit-learn, we need to represent all data **numerically**

- If the feature only has two categories, we can simply create a
**dummy variable**that represents the categories as a binary value:

```
# create a new Series called Size_large
data['Size_large'] = data.Size.map({'small':0, 'large':1})
data.head()
```

Let's redo the multiple linear regression and include the **Size_large** feature:

```
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'Size_large']
X = data[feature_cols]
y = data.Sales
# instantiate
lm2 = LinearRegression()
# fit
lm2.fit(X, y)
# print coefficients
list(zip(feature_cols, lm2.coef_))
```

Interpreting the **Size_large coefficient**

- For a given amount of TV/Radio/Newspaper ad spending, being a large market is associated with an average
**increase**in Sales of 57.42 widgets (as compared to a small market, which is called the**baseline level**). - What if we had reversed the 0/1 coding and created the feature 'Size_small' instead?
- The coefficient would be the same, except it would be
**negative instead of positive** - As such, your choice of category for the baseline does not matter, all that changes is your
**interpretation**of the coefficient

- The coefficient would be the same, except it would be

## 18. Handling Categorical Features with More than Two Categories¶

Let's create a new feature called **Area**, and randomly assign observations to be **rural, suburban, or urban**:

```
# set a seed for reproducibility
np.random.seed(123456)
# assign roughly one third of observations to each group
nums = np.random.rand(len(data))
mask_suburban = (nums > 0.33) & (nums < 0.66)
mask_urban = nums > 0.66
data['Area'] = 'rural'
# Series.loc is a purely label-location based indexer for selection by label
data.loc[mask_suburban, 'Area'] = 'suburban'
data.loc[mask_urban, 'Area'] = 'urban'
data.head()
```

**Ordered vs Unordered Categories**

- Have to represent Area numerically
- Cannot code it as 0=rural, 1=suburban, 2=urban because that would imply an
**ordered relationship**between suburban and urban - Urban would be somehow "twice" the suburban category

- Cannot code it as 0=rural, 1=suburban, 2=urban because that would imply an
- Ordered categories
- i.e., strongly disagree, disagree, neutral, agree, strongly agree
- Can use a single dummy variable and represent the categories numerically (such as 1, 2, 3, 4, 5

Our Area feature is unordered, so we have to create **additional dummy variables**. Let's explore how to do this using pandas:

```
# create three dummy variables using get_dummies
pd.get_dummies(data.Area, prefix='Area').head()
```

However, we actually only need **two dummy variables, not three**. Why? Because two dummies captures all of the "information" about the Area feature, and implicitly defines rural as the "baseline level".

Let's see what that looks like:

```
# create three dummy variables using get_dummies, then exclude the first dummy column
area_dummies = pd.get_dummies(data.Area, prefix='Area').iloc[:, 1:]
area_dummies.head()
```

Here is how we interpret the coding:

**rural**is coded as Area_suburban=0 and Area_urban=0**suburban**is coded as Area_suburban=1 and Area_urban=0**urban**is coded as Area_suburban=0 and Area_urban=1

If this is confusing, think about why we only needed one dummy variable for Size (Size_large), not two dummy variables (Size_small and Size_large). In general, if you have a categorical feature with k "levels", you create k-1 dummy variables.

Anyway, let's add these two new dummy variables onto the original DataFrame, and then include them in the linear regression model:

```
# concatenate the dummy variable columns onto the DataFrame (axis=0 means rows, axis=1 means columns)
data = pd.concat([data, area_dummies], axis=1)
data.head()
```

```
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper', 'Size_large', 'Area_suburban', 'Area_urban']
X = data[feature_cols]
y = data.Sales
# instantiate and fit
lm2 = LinearRegression()
lm2.fit(X, y)
# print the coefficients
list(zip(feature_cols, lm2.coef_))
```

How do we interpret the coefficients?

- Holding all other variables fixed, being a
**suburban**area is associated with an average**decrease**in Sales of 106.56 widgets (as compared to the baseline level, which is rural). - Being an
**urban**area is associated with an average**increase**in Sales of 268.13 widgets (as compared to rural).

## 19. What Didn't We Cover?¶

- Detecting collinearity
- Diagnosing model fit
- Transforming features to fit non-linear relationships
- Interaction terms
- Assumptions of linear regression
- And so much more!

Notes

- You could certainly go very deep into linear regression, and learn how to apply it really, really well
- It's an excellent way to
**start your modeling process**when working a regression problem - However, it is limited by the fact that it can only make good predictions if there is a
**linear relationship**between the features and the response, which is why more complex methods (with higher variance and lower bias) will often outperform linear regression

## 20. Resources¶

- To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning, from which this lesson was adapted. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
- To learn more about Statsmodels and how to interpret the output, DataRobot has some decent posts on simple linear regression and multiple linear regression.
- This introduction to linear regression is much more detailed and mathematically thorough, and includes lots of good advice.
- This is a relatively quick post on the assumptions of linear regression.