Convert categorical data into numerical data automatically
One Hot Encoding in Scikit-Learn

One-Hot Encoding in Scikit-learn

Intuition

  1. You will prepare your categorical data using LabelEncoder()
  2. You will apply OneHotEncoder() on your new DataFrame in step 1
In [2]:
# import 
import numpy as np
import pandas as pd
In [4]:
# load dataset
X = pd.read_csv('titanic_data.csv')
X.head(3)
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
In [6]:
# limit to categorical data using df.select_dtypes()
X = X.select_dtypes(include=[object])
X.head(3)
Out[6]:
Name Sex Ticket Cabin Embarked
0 Braund, Mr. Owen Harris male A/5 21171 NaN S
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female PC 17599 C85 C
2 Heikkinen, Miss. Laina female STON/O2. 3101282 NaN S
In [49]:
# check original shape
X.shape
Out[49]:
(891, 5)
In [8]:
# import preprocessing from sklearn
from sklearn import preprocessing
In [21]:
# view columns using df.columns
X.columns
Out[21]:
Index([u'Name', u'Sex', u'Ticket', u'Cabin', u'Embarked'], dtype='object')
In [31]:
# TODO: create a LabelEncoder object and fit it to each feature in X


# 1. INSTANTIATE
# encode labels with value between 0 and n_classes-1.
le = preprocessing.LabelEncoder()


# 2/3. FIT AND TRANSFORM
# use df.apply() to apply le.fit_transform to all columns
X_2 = X.apply(le.fit_transform)
X_2.head()
Out[31]:
Name Sex Ticket Cabin Embarked
0 108 1 523 0 3
1 190 0 596 82 1
2 353 0 669 0 3
3 272 0 49 56 3
4 15 1 472 0 3

OneHotEncoder

  • Encode categorical integer features using a one-hot aka one-of-K scheme.
  • The input to this transformer should be a matrix of integers, denoting the values taken on by categorical (discrete) features.
  • The output will be a sparse matrix where each column corresponds to one possible value of one feature.
  • It is assumed that input features take on values in the range [0, n_values).
  • This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
In [50]:
# TODO: create a OneHotEncoder object, and fit it to all of X

# 1. INSTANTIATE
enc = preprocessing.OneHotEncoder()

# 2. FIT
enc.fit(X_2)

# 3. Transform
onehotlabels = enc.transform(X_2).toarray()
onehotlabels.shape

# as you can see, you've the same number of rows 891
# but now you've so many more columns due to how we changed all the categorical data into numerical data
Out[50]:
(891, 1726)
In [47]:
onehotlabels
Out[47]:
array([[ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  1.],
       [ 0.,  0.,  0., ...,  1.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  1.,  0.]])
In [48]:
type(onehotlabels)
Out[48]:
numpy.ndarray