Decision Trees, ID3, Entropy, Information again and more.

## Decision Trees Classification Model¶

- Supervised Learning
- Classification
- Taking an input X
- Mapping to a discrete label y

**Classification learning terms**

- Instances
- Input
- Pictures
- Credit score examples

- Input
- Concept
- Function (mappings from objects to members of a set)
- Takes input and maps the input to an output

- Function (mappings from objects to members of a set)
- Target concept
- The answer we want
- The function we want

- Hypothesis Class
- Set of all concepts you're willing to entertain
- All functions we care about (classification-only)

- Sample
- Training set

- Candidate
- Concept that you think might be the target concept
- Concept = Target Concept

- Testing set
- Take candidate concept and test if it does a good job

**Decision Trees Introduction**

- It's simply asking a series of questions
- You'll have decision nodes
- Pick an attribute and ask a question (is sex male?)
- Values = edges (lines)
- Yes
- Ask a different question (sub-node)

- No
- Ask a different question (sub-node)

- Yes

- Values = edges (lines)

- Pick an attribute and ask a question (is sex male?)
- As you can see here, you can continue to ask more questions with more nodes down the tree.
- This is an example tree from the Titanic Survivors

- As you can see here, you can continue to ask more questions with more nodes down the tree.
- Intuition
- Features' headers = questions
- sex_male
- age_>_9.5
- sibsp_>_2.5

- Features = answers

- Features' headers = questions

**Decision Trees: Learning**

- Pick best attribute
- Ask question
- Follow the answer path
- Go to (1)
- Repeat until you get an answer

**Decision Trees: Expressiveness**

- n-OR: ANY
- You need n number (linear) of nodes
- Easy!

- You need n number (linear) of nodes
- n-XOR: ODD PARITY
- If the number of attributes that are true = odd, output of the function would be true
- Otherwise, false

- You need 2^N number (exponential) of nodes
- Hard!

- If the number of attributes that are true = odd, output of the function would be true

**ID3 Algorithm**

- A: best attribute
- Assign A as decision attribute for node
- For each value of A, create a descendant of node
- Sort training examples to leaves
- If examples perfectly classified, stop
- Else, iterate over leaves

**ID3: Bias**

- Restriction Bias
- Restricted to hypothesis
- Example: only those considered in a decision tree

- Preference Bias
- What sort of hypothesis we prefer

- Inductive bias
- ID3 would prefer tree that has good splits near the top
- Prefers correct over incorrect
- Prefers shorter trees

- ID3 would prefer tree that has good splits near the top

**Choosing Best Attributes: Information Gain**

- Goal is to minimize entropy for maximum information gain
- Formula
- If all samples went to left side of tree
- No entropy gain in using that attribute
- Entropy constant

- If all samples went to each side of the tree with no clear split (all mixed up)
- No entropy gain
- Entropy constant

- If samples split into each side of the tree distinctly
- Max entropy gain
- Entropy fell

**Decision Trees: Other Considerations**

- What happens if we have continuous attributes?
- Age, weight, distance, etc.
- Create attribute with a range
- 20 <= age < 30
- This would return True or False

- You can repeat an attribute along a path for continuous attributes
- Ask 20 <= age < 30
- Ask age > 30 etc.
- But it does not make sense to repeat an attribute for discrete attributes!

- When do we stop?
- No overfitting (when the tree gets too big)!
- Solution? Pruning
- Hold out a validation set
- To build a full tree
- Cut leaves then check against validation set
- If error is low enough, stop expanding the tree

- Solution? Pruning

- No overfitting (when the tree gets too big)!

**Scikit-learn for Decision Trees**

In [1]:

```
# Import
from sklearn.tree import DecisionTreeClassifier
from sklearn.cross_validation import train_test_split
```

In [2]:

```
# Load iris dataset
from sklearn.datasets import load_iris
# Instantiate
iris = load_iris()
# Create training and feature
X = iris.data
y = iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
```

In [24]:

```
# Same 3-step process
# 1. Instantiate
# default criterion=gini
# you can swap to criterion=entropy
dtc = DecisionTreeClassifier(random_state=0)
# 2. Fit
dtc.fit(X_train, y_train)
# 3. Predict, there're 4 features in the iris dataset
y_pred_class = dtc.predict(X_test)
```

**Evaluating with Accuracy**

In [4]:

```
from sklearn import metrics
```

In [5]:

```
# Accuracy
metrics.accuracy_score(y_test, y_pred_class)
```

Out[5]:

**Dealing with overfitting: min_samples_split**

- default min_samples_split=2
- This means if you've 1 sample left in a node, you cannot split further
- When it gets really deep in depth, it overfits your data

- If you increase your min_samples_split value
- You would decrease the depth of your tree
- This is because you would run out of samples to split
- This would reduce overfitting

In [11]:

```
# 1. Instantiate with min_samples_split = 50
dtc = DecisionTreeClassifier(min_samples_split=4, random_state=0)
# 2. Fit
dtc.fit(X_train, y_train)
# 3. Predict, there're 4 features in the iris dataset
y_pred_class = dtc.predict(X_test)
# Accuracy
metrics.accuracy_score(y_test, y_pred_class)
```

Out[11]:

**Using GridSearchCV for optimizing parameters**

In [9]:

```
# Import
from sklearn.grid_search import GridSearchCV
# Define the parameter values that should be searched
sample_split_range = list(range(1, 50))
# Create a parameter grid: map the parameter names to the values that should be searched
# Simply a python dictionary
# Key: parameter name
# Value: list of values that should be searched for that parameter
# Single key-value pair for param_grid
param_grid = dict(min_samples_split=sample_split_range)
# instantiate the grid
grid = GridSearchCV(dtc, param_grid, cv=10, scoring='accuracy')
# fit the grid with data
grid.fit(X_train, y_train)
```

Out[9]:

In [10]:

```
# examine the best model
# Single best score achieved across all params (min_samples_split)
print(grid.best_score_)
# Dictionary containing the parameters (min_samples_split) used to generate that score
print(grid.best_params_)
# Actual model object fit with those best parameters
# Shows default parameters that we did not specify
print(grid.best_estimator_)
```

**Entropy**

- Controls how a Decision Tree decides where to split the data
- Definition
- Measure of impurity in a bunch of examples
- Opposite of purity

- Concept
- Find variables and split
- Split such that the sets contain the least impurities
- Repeat the process until sets are pure

- Values Entropy can hold values between 0 and 1.0
- All examples same class
- Entropy = 0

- All examples are evenly split amongst classes
- Entropy = 1.0

- All examples same class
- Formula
- p(i) is the fraction of examples in class i
- sum over all classes

- Example of Formula
- Assume 2 classes for the node
- slow (2 examples)
- fast (2 examples)

- p(slow) = fraction of slow over all examples = 2/4 = 0.5
- p(fast) = fraction of fast over all examples = 2/4 = 0.5
- entropy = -p(slow)
*log2(pslow) + (-p(fast)*log2(pslow))

- Assume 2 classes for the node

In [17]:

```
# Calculating impurity from the example above
# Maximum state of impurity, 1.0
import numpy as np
-0.5*np.log2(0.5) + (-0.5*np.log2(0.5))
```

Out[17]:

**Information gain**

- Formula
- information gain = entropy(parent) - [weighted average]entropy(children)
- weight = number of examples in set / total number of examples

- information gain = entropy(parent) - [weighted average]entropy(children)
- Decision tree algorithm: maximizing information gain
- Example
- Assume the feature "grade" has the following details
- Steep (Slow, S)
- Steep (Slow, S)
- Flat (Fast, F)
- Steep (Fast, F)

- If we split based on "grade": is it steep?
- Child 1: Flat (Fast)
- Child 2: Steep (Slow, Slow, Fast)

- Assume the feature "grade" has the following details

In [19]:

```
# Entropy of child 1 = 0
# Perfect split for this child
# Entropy of child 2 = 0.918
-(2/3)*np.log2(2/3) - (1/3)*np.log2(1/3)
```

Out[19]:

In [22]:

```
# Weighted average of entropy(children)
(3/4)*(0.9184) + (1/4)*0
```

Out[22]:

In [23]:

```
# Entropy Gain
1 - (3/4)*(0.9184) + (1/4)*0
```

Out[23]:

**Decision Trees**

- Prone to overfitting
- If you have a lot of features
- Stop growth of tree at the appropriate time

- You can build bigger classifiers out of this
- It is called ensemble classifiers