Feature engineering and scaling with scikit-learn.

## Feature Engineering: Scaling and Selection¶

**Feature Scaling**

- Formula
- $$X' = \frac {X - X_{min}}{X_{max} - X_{min}}$$

**Algorithms affected by feature rescaling**

- Algorithms in which two dimensions affect the outcome will be affected by rescaling
- SVM with RBF kernel
- When you maximize the distance, you've 2 or more dimensions
- Think of the x and y axis with different dimensions and you need to calculate the distance

- When you maximize the distance, you've 2 or more dimensions
- K-means clustering

- SVM with RBF kernel

**Feature Scaling Manually in Python**

In [21]:

```
### FYI, the most straightforward implementation might
### throw a divide-by-zero error, if the min and max values are the same
### but think about this for a second--that means that every
### data point has the same value for that feature!
### why would you rescale it? Or even use it at all?
def featureScaling(arr):
max_num = max(arr)
min_num = min(arr)
lst = []
for num in arr:
X_prime = (num - min_num) / (max_num - min_num)
lst.append(X_prime)
return lst
# tests of your feature scaler--line below is input data
data = [115, 140, 175]
print(featureScaling(data))
```

**Feature Scaling in Scikit-learn**

In [23]:

```
from sklearn.preprocessing import MinMaxScaler
import numpy as np
```

In [28]:

```
# 3 different training points for 1 feature
weights = np.array([[115], [140], [175]]).astype(float)
```

In [29]:

```
# Instantiate
scaler = MinMaxScaler()
```

In [33]:

```
# Rescale
rescaled_weights = scaler.fit_transform(weights)
rescaled_weights
```

Out[33]:

**Feature Selection**

- Why do we want to select features?
- Knowledge discovery
- Interpretability
- Insight

- Curse of dimensionality

- Knowledge discovery

**Feature Selection: Algorithms**

- How hard is the problem?
- Exponential
- $${n \choose m}$$
- $$2^n$$
- n choose m
- Assuming our original number of features: n
- New number of features: m
- Where m <= n

- Exponential

**Filtering vs Wrapping**

- Disadvantages of
**filtering**- No feedback
- Learning algorithm cannot inform on the impact of the changes in features
- Criteria built in search with no reference to the learner
- Ignores learning problem

- You'll look at features in isolation

- No feedback
- Advantages of
**filtering**- Fast

- Disadvantages of
**Wrapping**- Slow

- Advantages
**Wrapping**- Feedback
- Criteria built in the learner
- Takes into account model bias and learning

- Feedback

**Filtering: Search Part**

- We can use a Decision Tree algorithm for the search function then feed to the learner which does not do well with filtering features
- This is because DTs are good at filtering the best features
- If you want to know all the features, you can easily overfit

- Other generic ways
- Information gain
- Entropy
- "Useful" features
- Independent, non-redundant

**Wrapping: Search Part (have to deal with the exponential problem)**

- Hill climbing
- Randomized optimization
- Forward
- Start with m features
- You pass the features individually to the learning algorithms and get their scores
- You pick
- 2 features
- Choose highest score

- 3 features
- Choose highest score
- If lower than 2 features, stop

- 2 features

**Relevance: Measures Effect on BOC**

- X_i, feature, is strongly relevant if removing it degrades Bayes Optimal Classifier (BOC)
- BOC is the best you can do on average if you can find it

- X_i, feature, is weakly relevant if
- Not strongly relevant
- Subset of features S such that adding x_i to S improves BOC

- X_i is otherwise irrelevant

**Usefulness: Measures Effect on Particular Predictor**

- Minimizing error given a model/learner