One-Hot Encoding in Scikit-learn¶

Intuition

You will prepare your categorical data using LabelEncoder()
You will apply OneHotEncoder() on your new DataFrame in step 1

# import 
import numpy as np
import pandas as pd

# load dataset
X = pd.read_csv('titanic_data.csv')
X.head(3)

# limit to categorical data using df.select_dtypes()
X = X.select_dtypes(include=[object])
X.head(3)

# check original shape
X.shape

(891, 5)

# import preprocessing from sklearn
from sklearn import preprocessing

# view columns using df.columns
X.columns

Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')

# TODO: create a LabelEncoder object and fit it to each feature in X


# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
X_2.head()

OneHotEncoder

Encode categorical integer features using a one-hot aka one-of-K scheme.
The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
The output will be a sparse matrix where each column corresponds to one possible value of one feature.
It is assumed that input features take on values in the range [0, n_values).
This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

# TODO: create a OneHotEncoder object, and fit it to all of X

# 1. INSTANTIATE
enc = preprocessing.OneHotEncoder()

# 2. FIT
enc.fit(X_2)

# 3. Transform
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape

# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data

(891, 1726)

onehotlabels

array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.]])

type(onehotlabels)

numpy.ndarray

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	NaN	S

	Name	Sex	Ticket	Cabin	Embarked
0	Braund, Mr. Owen Harris	male	A/5 21171	NaN	S
1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	PC 17599	C85	C
2	Heikkinen, Miss. Laina	female	STON/O2. 3101282	NaN	S

	Name	Sex	Ticket	Cabin	Embarked
0	108	1	523	0	3
1	190	0	596	82	1
2	353	0	669	0	3
3	272	0	49	56	3
4	15	1	472	0	3

One Hot Encoding in Scikit-Learn

One-Hot Encoding in Scikit-learn¶