Feature engineering and scaling with scikit-learn.
Feature Engineering and Scaling

Feature Engineering: Scaling and Selection

Feature Scaling

  • Formula
    • $$X' = \frac {X - X_{min}}{X_{max} - X_{min}}$$

Algorithms affected by feature rescaling

  • Algorithms in which two dimensions affect the outcome will be affected by rescaling
    • SVM with RBF kernel
      • When you maximize the distance, you've 2 or more dimensions
        • Think of the x and y axis with different dimensions and you need to calculate the distance
    • K-means clustering

Feature Scaling Manually in Python

In [21]:
### FYI, the most straightforward implementation might 
### throw a divide-by-zero error, if the min and max values are the same
### but think about this for a second--that means that every
### data point has the same value for that feature!  
### why would you rescale it?  Or even use it at all?
def featureScaling(arr):

    max_num = max(arr)
    min_num = min(arr)

    lst = []

    for num in arr:
        X_prime = (num - min_num) / (max_num - min_num)
        lst.append(X_prime)

    return lst

# tests of your feature scaler--line below is input data
data = [115, 140, 175]
print(featureScaling(data))
[0.0, 0.4166666666666667, 1.0]

Feature Scaling in Scikit-learn

In [23]:
from sklearn.preprocessing import MinMaxScaler
import numpy as np
In [28]:
# 3 different training points for 1 feature
weights = np.array([[115], [140], [175]]).astype(float)
In [29]:
# Instantiate
scaler = MinMaxScaler()
In [33]:
# Rescale
rescaled_weights = scaler.fit_transform(weights)
rescaled_weights
Out[33]:
array([[ 0.        ],
       [ 0.41666667],
       [ 1.        ]])

Feature Selection

  • Why do we want to select features?
    • Knowledge discovery
      • Interpretability
      • Insight
    • Curse of dimensionality

Feature Selection: Algorithms

  • How hard is the problem?
    • Exponential
      • $${n \choose m}$$
      • $$2^n$$
        • n choose m
        • Assuming our original number of features: n
        • New number of features: m
          • Where m <= n

Filtering vs Wrapping

  • Disadvantages of filtering
    • No feedback
      • Learning algorithm cannot inform on the impact of the changes in features
      • Criteria built in search with no reference to the learner
      • Ignores learning problem
    • You'll look at features in isolation
  • Advantages of filtering
    • Fast
  • Disadvantages of Wrapping
    • Slow
  • Advantages Wrapping
    • Feedback
      • Criteria built in the learner
      • Takes into account model bias and learning

Filtering: Search Part

  • We can use a Decision Tree algorithm for the search function then feed to the learner which does not do well with filtering features
    • This is because DTs are good at filtering the best features
    • If you want to know all the features, you can easily overfit
  • Other generic ways
    • Information gain
    • Entropy
    • "Useful" features
    • Independent, non-redundant

Wrapping: Search Part (have to deal with the exponential problem)

  • Hill climbing
  • Randomized optimization
  • Forward
    • Start with m features
    • You pass the features individually to the learning algorithms and get their scores
    • You pick
      • 2 features
        • Choose highest score
      • 3 features
        • Choose highest score
        • If lower than 2 features, stop

Relevance: Measures Effect on BOC

  • X_i, feature, is strongly relevant if removing it degrades Bayes Optimal Classifier (BOC)
    • BOC is the best you can do on average if you can find it
  • X_i, feature, is weakly relevant if
    • Not strongly relevant
    • Subset of features S such that adding x_i to S improves BOC
  • X_i is otherwise irrelevant

Usefulness: Measures Effect on Particular Predictor

  • Minimizing error given a model/learner