Using pandas with scikit-learn effectively
Pandas and Scikit-Learn

This introduction to pandas is derived from Data School's pandas Q&A with my own notes and code.

Using pandas with scikit-learn to create Kaggle submissions

Kaggle is a popular platform for doing competitive machine learning.

In [1]:
import pandas as pd
In [3]:
url = ''
train = pd.read_csv(url)
In [4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Create X features (DataFrame)

In [5]:
# Pclass: passenger class
# Parch: parents and children
feature_cols = ['Pclass', 'Parch']
In [6]:
# you want all rows, and the feature_cols' columns
X = train.loc[:, feature_cols]
In [7]:
(891, 2)

Create y responses (Series)

In [8]:
# now we want to create our response vector
y = train.Survived
In [9]:

Build scikit-learn model

In [10]:
# 1. import
from sklearn.linear_model import LogisticRegression

# 2. instantiate model
logreg = LogisticRegression()

# 3. fit, y)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
In [11]:
url_test = ''
test = pd.read_csv(url_test)
In [13]:
# missing Survived column because we are predicting
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

X_new features for test data

In [14]:
X_new = test.loc[:, feature_cols]
In [15]:
(418, 2)
In [17]:
# 4. predict
new_pred_class = logreg.predict(X_new)

Save DataFrame to csv

In [21]:
# kaggle wants 2 columns
# new_pred_class
# PassengerId

# pandas would align them next to each other
# to ensure the first column is PassengerId, use .set_index
kaggle_data = pd.DataFrame({'PassengerId':test.PassengerId, 'Survived':new_pred_class}).set_index('PassengerId')

Save DataFrame to disk and access it

In [22]:
# save train data to disk using pickle
In [23]:
# read data
