Dimensionality reduction and feature transformation with scikit-learn.
Dimensionality Reduction and Feature Transformation

## Dimensionality Reduction: Feature Transformation¶

Feature Transformation

• The problem of pre-propressing a set of features (m) to create a new feature set (n) while retaining as much information as possible
• Number of features m reduced to n where
• $$m < n$$
• Transformation operator P^T
• $$P^Tx$$

• A lot of words
• Polysemy (a word with multiple meanings)
• Car
• Automobile
• First element in a cons cell in Lisp
• This would give
• False positives
• Synonomy (multiple words with the same meaning)
• False negatives
• We can combine words together for better indicators
• We can solve this using 3 kinds of algorithms
• Principal Components Analysis
• Independent Components Analysis

Measureable vs Latent Features

• Measurable features
• Square footage
• Number of rooms
• School ranking
• Neighborhood Safety
• Latent features: you're basically only "measuring" these 2 latent variables through all the measurable features
• Size
• Neighborhood
• How do we condense our features while preserving information?
• We can use Scikit-learn
• SelectKBest (k = no. of features to keep)
• In this scenario we should use this since we know we need only 2 latent variables
• This will throw all variables except the two which are the best
• SelectPercentile
• You could also use this and run at 50% for 2 features

Principal Component

• We have many features, but we hypothesise a smaller number of features actually driving the patterns
• We can try making a composite feature (principle component) that more directly probes the underlying phenomenon
• Using PCA for dimensionality reduction
• Using PCA for unsupervised learning
• Example
• Measurable Features
1. Square Footage
2. Number of Rooms
• Latent Feature
1. Size
• • We will project the points on the principle component
• It will be one dimension now

Determining the Principal Component

• Principal component of a dataset is the direction that has the largest variance
• Variance
• Retains the maximum amount of "information" in the original data

Maximal Variance and Information Loss

• Information loss
• Example
• The length of the yellow line
• • Total information loss is the sum of all the projected distances from the points on the principal component
• Projection onto direction of maximal variance minimizes diatance from old (higher-dimensional) data point to its new transformed value
• Minimizes information loss (sum of red lines instead of blue lines)
• Algorithm 1
PCA as a General Algorithm for Feature Transformation

• When to use PCA

• Latent features driving the pattersn in data
• Dimensionality reduction
• Visualize high-dimensional data
• You can easily draw scatterplots with 2-dimensional data
• Reduce noise
• You get rid of noise by throwing away less useful components
• Make other algorithms work better with fewer inputs
• Very high dimensionality might result in overfitting or take up a lot of computing power (time
• A typical example is in eigenfaces
• You can use PCA to reduce the dimensionality to then use SVMs for example

PCA Review

• Systematized way to transform input features into principal components (PC)
• Use PCs as new features
• PCs are directions in data that maximize variance (minimize information loss) when you project/compress donw onto them
• The more variance of data along a PC, the higher that PC is ranked
• Most variance, most information would be the first PC
• Second-most variance would be the second PC
• Max number of PCs = number of input features

PCA with Scikit-learn

In :
# Import
from sklearn.decomposition import PCA
from sklearn import datasets

In :
# Create data
X = iris.data
y = iris.target


Standardizing

• Whether to standardize the data prior to a PCA on the covariance matrix depends on the measurement scales of the original features
• Since PCA yields a feature subspace that maximizes the variance along the axes, it makes sense to standardize the data, especially, if it was measured on different scales. Although, all features in the Iris dataset were measured in centimeters, let us continue with the transformation of the data onto unit scale (mean=0 and variance=1), which is a requirement for the optimal performance of many machine learning algorithms.
In :
from sklearn.preprocessing import StandardScaler
X_std = StandardScaler().fit_transform(X)

In :
# Instantiate
pca = PCA(n_components=2)

# Fit and Apply dimensionality reduction on X
pca.fit_transform(X_std)

Out:
array([[ -2.26454173e+00,  -5.05703903e-01],
[ -2.08642550e+00,   6.55404729e-01],
[ -2.36795045e+00,   3.18477311e-01],
[ -2.30419716e+00,   5.75367713e-01],
[ -2.38877749e+00,  -6.74767397e-01],
[ -2.07053681e+00,  -1.51854856e+00],
[ -2.44571134e+00,  -7.45626750e-02],
[ -2.23384186e+00,  -2.47613932e-01],
[ -2.34195768e+00,   1.09514636e+00],
[ -2.18867576e+00,   4.48629048e-01],
[ -2.16348656e+00,  -1.07059558e+00],
[ -2.32737775e+00,  -1.58587455e-01],
[ -2.22408272e+00,   7.09118158e-01],
[ -2.63971626e+00,   9.38281982e-01],
[ -2.19229151e+00,  -1.88997851e+00],
[ -2.25146521e+00,  -2.72237108e+00],
[ -2.20275048e+00,  -1.51375028e+00],
[ -2.19017916e+00,  -5.14304308e-01],
[ -1.89407429e+00,  -1.43111071e+00],
[ -2.33994907e+00,  -1.15803343e+00],
[ -1.91455639e+00,  -4.30465163e-01],
[ -2.20464540e+00,  -9.52457317e-01],
[ -2.77416979e+00,  -4.89517027e-01],
[ -1.82041156e+00,  -1.06750793e-01],
[ -2.22821750e+00,  -1.62186163e-01],
[ -1.95702401e+00,   6.07892567e-01],
[ -2.05206331e+00,  -2.66014312e-01],
[ -2.16819365e+00,  -5.52016495e-01],
[ -2.14030596e+00,  -3.36640409e-01],
[ -2.26879019e+00,   3.14878603e-01],
[ -2.14455443e+00,   4.83942097e-01],
[ -1.83193810e+00,  -4.45266836e-01],
[ -2.60820287e+00,  -1.82847519e+00],
[ -2.43795086e+00,  -2.18539162e+00],
[ -2.18867576e+00,   4.48629048e-01],
[ -2.21111990e+00,   1.84337811e-01],
[ -2.04441652e+00,  -6.84956426e-01],
[ -2.18867576e+00,   4.48629048e-01],
[ -2.43595220e+00,   8.82169415e-01],
[ -2.17054720e+00,  -2.92726955e-01],
[ -2.28652724e+00,  -4.67991716e-01],
[ -1.87170722e+00,   2.32769161e+00],
[ -2.55783442e+00,   4.53816380e-01],
[ -1.96427929e+00,  -4.97391640e-01],
[ -2.13337283e+00,  -1.17143211e+00],
[ -2.07535759e+00,   6.91917347e-01],
[ -2.38125822e+00,  -1.15063259e+00],
[ -2.39819169e+00,   3.62390765e-01],
[ -2.22678121e+00,  -1.02548255e+00],
[ -2.20595417e+00,  -3.22378453e-02],
[  1.10399365e+00,  -8.63112446e-01],
[  7.32481440e-01,  -5.98635573e-01],
[  1.24210951e+00,  -6.14822450e-01],
[  3.97307283e-01,   1.75816895e+00],
[  1.07259395e+00,   2.11757903e-01],
[  3.84458146e-01,   5.91062469e-01],
[  7.48715076e-01,  -7.78698611e-01],
[ -4.97863388e-01,   1.84886877e+00],
[  9.26222368e-01,  -3.03308268e-02],
[  4.96802558e-03,   1.02940111e+00],
[ -1.24697461e-01,   2.65806268e+00],
[  4.38730118e-01,   5.88812850e-02],
[  5.51633981e-01,   1.77258156e+00],
[  7.17165066e-01,   1.85434315e-01],
[ -3.72583830e-02,   4.32795099e-01],
[  8.75890536e-01,  -5.09998151e-01],
[  3.48006402e-01,   1.90621647e-01],
[  1.53392545e-01,   7.90725456e-01],
[  1.21530321e+00,   1.63335564e+00],
[  1.56941176e-01,   1.30310327e+00],
[  7.38256104e-01,  -4.02470382e-01],
[  4.72369682e-01,   4.16608222e-01],
[  1.22798821e+00,   9.40914793e-01],
[  6.29381045e-01,   4.16811643e-01],
[  7.00472799e-01,   6.34939277e-02],
[  8.73536987e-01,  -2.50708611e-01],
[  1.25422219e+00,   8.26200998e-02],
[  1.35823985e+00,  -3.28820266e-01],
[  6.62126138e-01,   2.24346071e-01],
[ -4.72815133e-02,   1.05721241e+00],
[  1.21534209e-01,   1.56359238e+00],
[  1.41182261e-02,   1.57339235e+00],
[  2.36010837e-01,   7.75923784e-01],
[  1.05669143e+00,   6.36901284e-01],
[  2.21417088e-01,   2.80847693e-01],
[  4.31783161e-01,  -8.55136920e-01],
[  1.04941336e+00,  -5.22197265e-01],
[  1.03587821e+00,   1.39246648e+00],
[  6.70675999e-02,   2.12620735e-01],
[  2.75425066e-01,   1.32981591e+00],
[  2.72335066e-01,   1.11944152e+00],
[  6.23170540e-01,  -2.75426333e-02],
[  3.30005364e-01,   9.88900732e-01],
[ -3.73627623e-01,   2.01793227e+00],
[  2.82944343e-01,   8.53950717e-01],
[  8.90531103e-02,   1.74908548e-01],
[  2.24356783e-01,   3.80484659e-01],
[  5.73883486e-01,   1.53719974e-01],
[ -4.57012873e-01,   1.53946451e+00],
[  2.52244473e-01,   5.95860746e-01],
[  1.84767259e+00,  -8.71696662e-01],
[  1.15318981e+00,   7.01326114e-01],
[  2.20634950e+00,  -5.54470105e-01],
[  1.43868540e+00,   5.00105223e-02],
[  1.86789070e+00,  -2.91192802e-01],
[  2.75419671e+00,  -7.88432206e-01],
[  3.58374475e-01,   1.56009458e+00],
[  2.30300590e+00,  -4.09516695e-01],
[  2.00173530e+00,   7.23865359e-01],
[  2.26755460e+00,  -1.92144299e+00],
[  1.36590943e+00,  -6.93948040e-01],
[  1.59906459e+00,   4.28248836e-01],
[  1.88425185e+00,  -4.14332758e-01],
[  1.25308651e+00,   1.16739134e+00],
[  1.46406152e+00,   4.44147569e-01],
[  1.59180930e+00,  -6.77035372e-01],
[  1.47128019e+00,  -2.53192472e-01],
[  2.43737848e+00,  -2.55675734e+00],
[  3.30914118e+00,   2.36132010e-03],
[  1.25398099e+00,   1.71758384e+00],
[  2.04049626e+00,  -9.07398765e-01],
[  9.73915114e-01,   5.71174376e-01],
[  2.89806444e+00,  -3.97791359e-01],
[  1.32919369e+00,   4.86760542e-01],
[  1.70424071e+00,  -1.01414842e+00],
[  1.95772766e+00,  -1.00333452e+00],
[  1.17190451e+00,   3.18896617e-01],
[  1.01978105e+00,  -6.55429631e-02],
[  1.78600886e+00,   1.93272800e-01],
[  1.86477791e+00,  -5.55381532e-01],
[  2.43549739e+00,  -2.46654468e-01],
[  2.31608241e+00,  -2.62618387e+00],
[  1.86037143e+00,   1.84672394e-01],
[  1.11127173e+00,   2.95986102e-01],
[  1.19746916e+00,   8.17167742e-01],
[  2.80094940e+00,  -8.44748194e-01],
[  1.58015525e+00,  -1.07247450e+00],
[  1.34704442e+00,  -4.22255966e-01],
[  9.23432978e-01,  -1.92303705e-02],
[  1.85355198e+00,  -6.72422729e-01],
[  2.01615720e+00,  -6.10397038e-01],
[  1.90311686e+00,  -6.86024832e-01],
[  1.15318981e+00,   7.01326114e-01],
[  2.04330844e+00,  -8.64684880e-01],
[  2.00169097e+00,  -1.04855005e+00],
[  1.87052207e+00,  -3.82821838e-01],
[  1.55849189e+00,   9.05313601e-01],
[  1.52084506e+00,  -2.66794575e-01],
[  1.37639119e+00,  -1.01636193e+00],
[  9.59298576e-01,   2.22839447e-02]])
In :
# Where the eigenvalues live
# You know first component and second component
# has a and b percent of the data respectively
pca.explained_variance_ratio_

Out:
array([ 0.72770452,  0.23030523])
In :
# Access components
pc_1 = pca.components_
print(pc_1)
pc_2 = pca.components_
print(pc_2)

[ 0.52237162 -0.26335492  0.58125401  0.56561105]
[-0.37231836 -0.92555649 -0.02109478 -0.06541577]


PCA for Facial Recognition

• Facial recognition is good for PCA because
• Pictures of faces generally have high input dimensionality
• Many pixels
• Faces have general patterns that could be captured in smaller number of dimensions
• A pair of eyes
• Mouth
• Chin

Scikit-learn: PCA for Facial Recognition

• We will be reducing 1850 PCs to 150 PCs
In :
from time import time
import logging
import pylab as pl
from sklearn.cross_validation import train_test_split
from sklearn.datasets import fetch_lfw_people
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix, f1_score
from sklearn.decomposition import RandomizedPCA
from sklearn.svm import SVC

In :
# Data of famous people's faces
faces = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
X = faces.data
y = faces.target
target_names = faces.target_names
n_classes = target_names.shape

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In :
# Introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = faces.images.shape

# For machine learning we use the data directly (as relative pixel
# position info is ignored by this model)
X = faces.data
n_features = X.shape

# the label to predict is the id of the person
y = faces.target
target_names = faces.target_names
n_classes = target_names.shape

print("Total dataset size:")
print("n_samples: %d" % n_samples)
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)

Total dataset size:
n_samples: 1288
n_features: 1850
n_classes: 7

In :
# Compute a PCA (eigenfaces) on the face dataset
n_components = 150

print("Extracting the top {} eigenfaces from {} faces".format(n_components, X_train.shape))
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)

# eigenfaces: principal components
# Takes pca.components and reshape them
# We've gone from 1800 to 150
eigenfaces = pca.components_.reshape((n_components, h, w))

# Transform data into principal components representation
print("Projecting the input data on the eigenfaces orthonormal basis")
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

Extracting the top 150 eigenfaces from 966 faces
Projecting the input data on the eigenfaces orthonormal basis

In :
# Train an SVM classification model

print("Fitting the classifier to the training set")

param_grid = {
'C': [1e3, 5e3, 5e4, 1e5],
'gamma':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]
}

# Instantiate model
svm = SVC(kernel='rbf', class_weight='balanced', random_state=42)

# GridSearch
clf = GridSearchCV(svm, param_grid)
clf.fit(X_train_pca, y_train)
print(clf.best_estimator_)

Fitting the classifier to the training set
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
max_iter=-1, probability=False, random_state=42, shrinking=True,
tol=0.001, verbose=False)

In :
# Quantitative evaluation of the model quality on the test set
print("Predicting the people names on the testing test")
y_pred = clf.predict(X_test_pca)

Predicting the people names on the testing test

In :
# Classification report and confusion matrix
print(classification_report(y_test, y_pred, target_names=target_names))

                   precision    recall  f1-score   support

Ariel Sharon       0.47      0.54      0.50        13
Colin Powell       0.72      0.85      0.78        60
Donald Rumsfeld       0.67      0.52      0.58        27
George W Bush       0.84      0.86      0.85       146
Gerhard Schroeder       0.80      0.80      0.80        25
Hugo Chavez       0.88      0.47      0.61        15
Tony Blair       0.76      0.69      0.72        36

avg / total       0.78      0.77      0.77       322


In :
# F1-score average
# The F1 score can be interpreted as a weighted average of the precision and recall
# Where an F1 score reaches its best value at 1 and worst score at 0
# The relative contribution of precision and recall to the F1 score are equal
# The formula for the F1 score is:
# F1 = 2 * (precision * recall) / (precision + recall)

print(f1_score(y_test, y_pred, average='weighted'))

0.769918515887

In :
# Confusion Matrix
print(confusion_matrix(y_test, y_pred, labels=range(n_classes)))

[[  7   2   3   0   1   0   0]
[  2  51   1   3   0   1   2]
[  3   2  14   8   0   0   0]
[  3  12   2 125   1   0   3]
[  0   1   0   3  20   0   1]
[  0   2   0   3   1   7   2]
[  0   1   1   7   2   0  25]]

In :
# Qualitative evaluation of the predictions using matplotlib
%matplotlib inline

def plot_gallery(images, titles, h, w, n_row=3, n_col=4):
"""Helper function to plot a gallery of portraits"""
pl.figure(figsize=(1.8 * n_col, 2.4 * n_row))
for i in range(n_row * n_col):
pl.subplot(n_row, n_col, i + 1)
pl.imshow(images[i].reshape((h, w)), cmap=pl.cm.gray)
pl.title(titles[i], size=12)
pl.xticks(())
pl.yticks(())

# plot the result of the prediction on a portion of the test set

def title(y_pred, y_test, target_names, i):
pred_name = target_names[y_pred[i]].rsplit(' ', 1)[-1]
true_name = target_names[y_test[i]].rsplit(' ', 1)[-1]
return 'predicted: %s\ntrue:      %s' % (pred_name, true_name)

prediction_titles = [title(y_pred, y_test, target_names, i)
for i in range(y_pred.shape)]

plot_gallery(X_test, prediction_titles, h, w)

# plot the gallery of the most significative eigenfaces

eigenface_titles = ["eigenface %d" % i for i in range(eigenfaces.shape)]
plot_gallery(eigenfaces, eigenface_titles, h, w)

pl.show()  Variance explained by each principal component

In :
# Variance explained by first component: 0.19346474
# Variance explained by second component: 0.15116931
pca.explained_variance_ratio_

Out:
array([ 0.19346474,  0.15116931,  0.07083688,  0.05952028,  0.05157574,
0.02887213,  0.02514474,  0.02176463,  0.0201937 ,  0.01902118,
0.01682174,  0.01580626,  0.01223351,  0.01087937,  0.01064428,
0.00979671,  0.00892415,  0.00854861,  0.00835728,  0.00722645,
0.0069658 ,  0.00653871,  0.00639547,  0.0056132 ,  0.00531102,
0.00520167,  0.00507469,  0.00484211,  0.00443586,  0.0041782 ,
0.00393684,  0.00381711,  0.00356077,  0.00351197,  0.00334554,
0.00329936,  0.00314637,  0.00296207,  0.00290131,  0.00284712,
0.00279984,  0.00267544,  0.00259903,  0.00258378,  0.00240921,
0.00238993,  0.0023542 ,  0.00222581,  0.00217505,  0.00216559,
0.00209064,  0.00205428,  0.00200421,  0.00197374,  0.00193836,
0.00188752,  0.00180161,  0.00178887,  0.00174822,  0.00173048,
0.00165642,  0.00162942,  0.00157415,  0.00153418,  0.00149966,
0.00147248,  0.00143912,  0.0014187 ,  0.00139686,  0.00138139,
0.00134001,  0.00133159,  0.00128796,  0.00125587,  0.00124233,
0.00121846,  0.00120938,  0.00118284,  0.00115081,  0.00113661,
0.00112612,  0.0011161 ,  0.00109366,  0.0010715 ,  0.00105619,
0.00104305,  0.00102377,  0.00101666,  0.00099757,  0.00096327,
0.00094085,  0.00091937,  0.00091261,  0.00089172,  0.00087085,
0.00086153,  0.00084233,  0.00083839,  0.0008275 ,  0.00080058,
0.00078658,  0.00077967,  0.00075576,  0.00074998,  0.0007463 ,
0.00073166,  0.00073109,  0.00071453,  0.00070159,  0.0006959 ,
0.0006665 ,  0.00066036,  0.00065491,  0.00063732,  0.00063183,
0.0006231 ,  0.00061431,  0.00060845,  0.00059761,  0.00059021,
0.00057756,  0.00056924,  0.00056306,  0.00055701,  0.00054258,
0.00054046,  0.000527  ,  0.00052035,  0.00050859,  0.0005063 ,
0.00050386,  0.00048574,  0.00048083,  0.0004726 ,  0.0004707 ,
0.00046785,  0.00045598,  0.00044888,  0.00043977,  0.00043656,
0.00043214,  0.00042051,  0.00041984,  0.00041423,  0.0004064 ,
0.00040368,  0.00039458,  0.00038438,  0.00037871,  0.00037464])

F1 score variation as we change the number of principal components

• How do we select how many PCs to use?
• Train on different number of PCs
• See how accuracy responds
• Cut off when it becomes apparent that adding more PCs doe not buy you much more discrimination
• DO NOT select features before performing PCA
• As you add more PCs
• It should give you additional signal to improve our performance
• But it is also possible that we end up with greater complexity resulting in overfitting
In :
PC = [10, 15, 25, 50, 100, 250]
scores = []

for i in PC:
# Loop through number of components
n_components = i

# Instantiate
pca = RandomizedPCA(n_components=n_components, whiten=True).fit(X_train)

# Redefine training data
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# Set param_grid
param_grid = {
'C': [1e3, 5e3, 5e4, 1e5],
'gamma':[0.0001, 0.0005, 0.001, 0.005, 0.01, 0.1]
}

# Instantiate model
svm = SVC(kernel='rbf', class_weight='balanced', random_state=42)

# GridSearch
clf = GridSearchCV(svm, param_grid, n_jobs=-1)
clf.fit(X_train_pca, y_train)
# clf.best_estimator_

# Predict
y_pred = clf.predict(X_test_pca)

# Score
score = f1_score(y_test, y_pred, average='weighted')

# Append score to list
scores.append(score)

print(scores)

[0.47220748491808795, 0.65382531118443243, 0.73389997057170009, 0.80026156929718439, 0.84753652266061896, 0.80070914894695944]

In :
# Zip the data to compare
list(zip(scores, PC))

# As you can see, the greater the number of PCAs, the greater the F1 score
# However, it starts to decrease when you have too many PCs
# You can see the decrease in f1_score from PCs=100 to PCs=250

Out:
[(0.47220748491808795, 10),
(0.65382531118443243, 15),
(0.73389997057170009, 25),
(0.80026156929718439, 50),
(0.84753652266061896, 100),
(0.80070914894695944, 250)]

Algorithm 2
Independent Components Analysis (ICA)

• PCA
• Minimizing correlation by maximizing variance
• Means that a relatively small set of primitive constructs can be combined in a relatively small number of ways to build the control and data structures of the language
• Finding "common characteristics"
• Eigenfaces problem
• ICA
• Finding independence by converting (through a linear transformation) your input features into a new feature space such that
• New features are independent of one another
• Cocktail party problem
• Mixed up sounds from each of the microphones
• Once you use an ICA algorithm, you can split the 3 sources into 3 independent features instead of the original sources with each source having a mix of everything
• Police car
• Foreign language
• News
Property PCA ICA
Mutually Orthoganal
Mutually Independent
Maximal Variance
Maximal Mutual Information
Ordered Features
Bag of Features
Blind Source Separation Problem
Directional

ICA with Scikit-learn: Blind Source Separation

Other alternatives

1. Random Components Analysis (RCA)
• Generates random directions (projections)
• Advantage over PCA and ICA
• Fast
• Simple
2. Linear Discriminant Analysis (LDA)
• Finds a projection that discriminates based on the label
Tags: