Unsupervised learning application by identifying customer segments.
Identifying Customer Segments (Unsupervised Learning)

Project 3: Creating Customer Segments

Unsupervised Learning

Machine Learning Engineer Nanodegree

This notebook contains extensive answers and tips that go beyond what was taught and what is required. But the extra parts are very useful for your future projects. Feel free to fork my repository on Github here.

Getting Started

In this project, you will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Run the code block below to load the wholesale customers dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.

In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import renders as rs
from IPython.display import display # Allows the use of display() for DataFrames

# Show matplotlib plots inline (nicely formatted in the notebook)
%matplotlib inline

# Load the wholesale customers dataset
try:
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
    print "Dataset could not be loaded. Is the dataset missing?"
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

In this section, you will begin exploring the data through visualizations and code to understand how each feature is related to the others. You will observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which you will track through the course of this project.

Run the code block below to observe a statistical description of the dataset. Note that the dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. Consider what each category represents in terms of products you could purchase.

Description of Categories

  • FRESH: annual spending (m.u.) on fresh products (Continuous)
  • MILK: annual spending (m.u.) on milk products (Continuous)
  • GROCERY: annual spending (m.u.) on grocery products (Continuous)
  • FROZEN: annual spending (m.u.)on frozen products (Continuous)
  • DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)
  • DELICATESSEN: annual spending (m.u.) on and delicatessen products (Continuous)
    • "A store selling cold cuts, cheeses, and a variety of salads, as well as a selection of unusual or foreign prepared foods."
In [3]:
# Display a description of the dataset
stats = data.describe()
stats
Out[3]:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000

Implementation: Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. In the code block below, add three indices of your choice to the indices list which will represent the customers to track. It is suggested to try different sets of samples until you obtain customers that vary significantly from one another.

In [4]:
# Using data.loc to filter a pandas DataFrame
data.loc[[100, 200, 300],:]
Out[4]:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
100 11594 7779 12144 3252 8035 3029
200 3067 13240 23127 3941 9959 731
300 16448 6243 6360 824 2662 2005
In [5]:
# Retrieve column names
# Alternative code:
# data.keys()
data.columns
Out[5]:
Index([u'Fresh', u'Milk', u'Grocery', u'Frozen', u'Detergents_Paper',
       u'Delicatessen'],
      dtype='object')

Logic in selecting the 3 samples: Quartiles

  • As you can previously (in the object "stats"), we've the data showing the first and third quartiles.
  • We can filter samples that are starkly different based on the quartiles.
    • This way we've two establishments that belong in the first and third quartiles respectively in, for example, the Frozen category.
In [6]:
# Fresh filter
fresh_q1 = 3127.750000
display(data.loc[data.Fresh < fresh_q1, :].head())
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
16 1020 8816 12121 134 4508 1080
31 2612 4339 3133 2088 820 985
34 1502 1979 2262 425 483 395
35 688 5491 11091 833 4239 436
43 630 11095 23998 787 9529 72
In [7]:
# Frozen filter
frozen_q1 = 742.250000
display(data.loc[data.Frozen < frozen_q1, :].head())
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 12669 9656 7561 214 2674 1338
5 9413 8259 5126 666 1795 1451
6 12126 3199 6975 480 3140 545
8 5963 3648 6192 425 1716 750
12 31714 12319 11757 287 3881 2931
In [8]:
# Frozen
frozen_q3 = 3554.250000
display(data.loc[data.Frozen > frozen_q3, :].head(7))
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
3 13265 1196 4221 6404 507 1788
4 22615 5410 7198 3915 1777 5185
10 3366 5403 12974 4400 5977 1744
22 31276 1917 4469 9408 2381 4334
23 26373 36423 22019 5154 4337 16523
33 29729 4786 7326 6130 361 1083
39 56159 555 902 10002 212 2916

Hence we'll be choosing:

  • 43: Very low "Fresh" and very high "Grocery"
  • 12: Very low "Frozen" and very high "Fresh"
  • 39: Very high "Frozen" and very low "Detergens_Paper"
In [9]:
# TODO: Select three indices of your choice you wish to sample from the dataset
indices = [43, 12, 39]

# Create a DataFrame of the chosen samples
# .reset_index(drop = True) resets the index from 0, 1 and 2 instead of 100, 200 and 300 
samples = pd.DataFrame(data.loc[indices], columns = data.columns).reset_index(drop = True)
print "Chosen samples of wholesale customers dataset:"
display(samples)
Chosen samples of wholesale customers dataset:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 630 11095 23998 787 9529 72
1 31714 12319 11757 287 3881 2931
2 56159 555 902 10002 212 2916

Comparison of Samples and Means

In [58]:
# Import Seaborn, a very powerful library for Data Visualisation
import seaborn as sns

# Get the means 
mean_data = data.describe().loc['mean', :]

# Append means to the samples' data
samples_bar = samples.append(mean_data)

# Construct indices
samples_bar.index = indices + ['mean']

# Plot bar plot
samples_bar.plot(kind='bar', figsize=(14,8))
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1100010d0>

Comparing Samples' Percentiles

In [61]:
# First, calculate the percentile ranks of the whole dataset.
percentiles = data.rank(pct=True)

# Then, round it up, and multiply by 100
percentiles = 100*percentiles.round(decimals=3)

# Select the indices you chose from the percentiles dataframe
percentiles = percentiles.iloc[indices]

# Now, create the heat map using the seaborn library
sns.heatmap(percentiles, vmin=1, vmax=99, annot=True)
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x114e9cf50>

Question 1

Consider the total purchase cost of each product category and the statistical description of the dataset above for your sample customers.
What kind of establishment (customer) could each of the three samples you've chosen represent?
Hint: Examples of establishments include places like markets, cafes, and retailers, among many others. Avoid using names for establishments, such as saying "McDonalds" when describing a sample customer as a restaurant.

Answer:

  • Index 0: Coffee Cafe
    • Low spending on "Fresh", "Frozen" and "Delicatessen".
    • Majority of spending on "Grocery", "Milk" and "Detergents_Paper".
      • With some spending on "Delicatessen", it may be a cafe establishment serving drinks, coffee perhaps, with some ready-made food as a complimentary product.
  • Index 1: Upscale Restaurant
    • Low spending on "Frozen".
    • Majority of spending is a mix of "Fresh", "Milk, and "Grocery"
      • This may be an upscale restaurent with almost no spending on frozen foods.
      • Most upscale restaurants only use fresh foods.
  • Index 2: Fresh Food Retailer
    • Majority of spending is on "Fresh" goods with little spending on everything else except on "Frozen".
      • This may be a grocery store specializing in fresh foods with some frozen goods.

Implementation: Feature Relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In the code block below, you will need to implement the following:

  • Assign new_data a copy of the data by removing a feature of your choice using the DataFrame.drop function.
  • Use sklearn.cross_validation.train_test_split to split the dataset into training and testing sets.
    • Use the removed feature as your target label. Set a test_size of 0.25 and set a random_state.
  • Import a decision tree regressor, set a random_state, and fit the learner to the training data.
  • Report the prediction score of the testing set using the regressor's score function.
In [10]:
# Existing features
data.columns
Out[10]:
Index([u'Fresh', u'Milk', u'Grocery', u'Frozen', u'Detergents_Paper',
       u'Delicatessen'],
      dtype='object')
In [11]:
# Imports
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
In [12]:
# Create list to loop through
dep_vars = list(data.columns)


# Create loop to test each feature as a dependent variable
for var in dep_vars:

    # TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
    new_data = data.drop([var], axis = 1)
    # Confirm drop
    # display(new_data.head(2))

    # Create feature Series (Vector)
    new_feature = pd.DataFrame(data.loc[:, var])
    # Confirm creation of new feature
    # display(new_feature.head(2))

    # TODO: Split the data into training and testing sets using the given feature as the target
    X_train, X_test, y_train, y_test = train_test_split(new_data, new_feature, test_size=0.25, random_state=42)

    # TODO: Create a decision tree regressor and fit it to the training set
    # Instantiate
    dtr = DecisionTreeRegressor(random_state=42)
    # Fit
    dtr.fit(X_train, y_train)

    # TODO: Report the score of the prediction using the testing set
    # Returns R^2
    score = dtr.score(X_test, y_test)
    print('R2 score for {} as dependent variable: {}'.format(var, score))
R2 score for Fresh as dependent variable: -0.385749710204
R2 score for Milk as dependent variable: 0.156275395017
R2 score for Grocery as dependent variable: 0.681884008544
R2 score for Frozen as dependent variable: -0.210135890125
R2 score for Detergents_Paper as dependent variable: 0.271666980627
R2 score for Delicatessen as dependent variable: -2.2547115372

Question 2

Which feature did you attempt to predict? What was the reported prediction score? Is this feature is necessary for identifying customers' spending habits?
Hint: The coefficient of determination, R^2, is scored between 0 and 1, with 1 being a perfect fit. A negative R^2 implies the model fails to fit the data.

Answer:

  • I used a loop and predicted every single feature as a dependent variable with the results shown above.
  • As you can see, "Fresh", "Frozen" and "Delicatessen" as dependent variables have negative R2 scores.
    • Their negative scores imply that they are necessary for identifying customers' spending habits because the remaining features cannot explain the variation in them.
  • Similarly, "Milk" and "Detergents_Paper" have very low R2 scores.
    • Their low scores also imply that they are necessary for identifying customers' spending habits.
  • However, "Grocery" has a R2 score of 0.68.
    • Now this is, in my opinion, a low score. But relative to the others it is much higher.
    • It may be not as necessary, compared to the other features, for identifying customers' spending habits.
    • We will explore this further.

Visualize Feature Distributions

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data. Run the code block below to produce a scatter matrix.

In [13]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Correlation Matrix

  • This is to cross-reference with the scatter matrix above to draw more accurate insights from the data.
  • The higher the color is on the bar, the higher the correlation.
In [14]:
import matplotlib.pyplot as plt
%matplotlib inline

def plot_corr(df,size=10):
    '''Function plots a graphical correlation matrix for each pair of columns in the dataframe.

    Input:
        df: pandas DataFrame
        size: vertical and horizontal size of the plot'''

    corr = df.corr()
    fig, ax = plt.subplots(figsize=(size, size))
    cax = ax.matshow(df, interpolation='nearest')
    ax.matshow(corr)
    fig.colorbar(cax)
    plt.xticks(range(len(corr.columns)), corr.columns);
    plt.yticks(range(len(corr.columns)), corr.columns);


plot_corr(data)

Question 3

Are there any pairs of features which exhibit some degree of correlation? Does this confirm or deny your suspicions about the relevance of the feature you attempted to predict? How is the data for those features distributed?
Hint: Is the data normally distributed? Where do most of the data points lie?

Answer:
I have plotted a correlation matrix to compare with the scatter matrix to ensure this answer is as accurate as possible.

  • The follow pairs of features seem to have some correlation as observed from the scatter plot showing a linear trend and the correlation plot showing a high correlation between the two features. I have ranked them in order of correlation from strongest to weakest.
    • Grocery and Detergents_Paper.
    • Grocery and Milk.
    • Detergents_Paper and Milk (not too strong).
  • These features that are strongly correlated does lend credence to our initial claim that Grocery may not be necessary for identifying customers' spending habits.
    • Grocery has a high correlation with Detergents_Paper and Milk that corresponds to a relatively high R2 score when we regress Grocery on all other features.
  • The data are not normally distributed due to the presence of many outliers.
    • Evidently, most are skewed to the left where most of the data points lie.
    • This indicates how normalization is required to make the data features normally distributed as clustering algorithms require them to be normally distributed.

Data Preprocessing

In this section, you will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.

Implementation: Feature Scaling

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

In the code block below, you will need to implement the following:

  • Assign a copy of the data to log_data after applying a logarithm scaling. Use the np.log function for this.
  • Assign a copy of the sample data to log_samples after applying a logrithm scaling. Again, use np.log.
In [15]:
# TODO: Scale the data using the natural logarithm
log_data = np.log(data)

# TODO: Scale the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');