Unsupervised Learning: Clustering¶

Algorithms

K-means (hard clustering)
Single Link Clustering (hard clustering)
Expectation Maximization (soft clustering)

A. K-means Algorithm

Assign
- We will choose cluster centers
  - They are called centroids
  - Choosing the initial location of the centroids would affect the final clustering results
Optimize
- We will move the cluster centers to minimize the total bands' length
  - The web connects the cluster center to the observations like a spider web
    - So we want to minimize the total length of the web for each centroid

Limitations of K-means

Suppose you use a fixed number of centroids and training set
- You would not get the same results
- It is highly dependent on where you put your centroids initially
  - You might reach a local minima
- The more centroids you have, the more local minima you will have

B. Single Linkage Clustering (SLC)

Consider each object a cluster (n objects)
Define intercluster distance as the distance between the closest two points in the two two clusters
Merge two closest clusters
Repeat n-k times to make k clusters
- In sum, it's just linking up the nearest points
  - Just connect the dots to the nearest dots in a linear fashion

Running Time of SLC

O(n^3) or slightly lower

Soft Clustering
The idea of 'soft' clustering is that, instead of deciding on definite clusters for each data point, you instead assign each data point a probability that it belongs to one of K clusters.

Select one of K Gaussians uniformly
Sample x_i from that Gaussian
Repeat n times
Find a hypothesis that maximizes the probability of the data (maximum likelihood)

C. Expectation Maximization

The general overview of this is similar to K-means, but instead of judging solely by distance, you are judging by probability, and you are never fully assigning one data point fully to one cluster or one cluster fully to one data point
You instead assign each point partially to each cluster based on the probability that it would belong to the cluster if you fully knew the clusters, and then assign the mean of each cluster based on the assumption that your prior probabilities were correct, and repeat until you stop getting significant changes

Properties of Expectation Maximization

Monotonically non-decreasing likelihood
Does not converge (practically does)
Will not diverge
Can get stuck in local optima
- Solution: randomly restart
Works with any distribution (if EM solvable)

Clustering Properties
No clustering scheme (algorithm) that can achieve all three properties

Richness
- For any assignment of objects to clusters, there is some distance matrix D such that P_d, clustering scheme, returns that clustering
  - There are multiple types of clustering
Scale-invariance
- Scaling distances by a positive value does not change the clustering
- Changing units (km to m)
Consistency
- Shrinking intracluster distances and expanding intercluster distances does not change the clustering

K-means clustering using scikit-learn

Hyperparameters available for tuning
- n_clusters=8
  - This is what you can and should change
- max_iter=300
  - This determines the number of iterations of:
    1. Assign
    2. Optimize (moving the centroids)
- n_init=10
  - Number of times to initialize the algorithm
  - If you think your data might need more initializations, you can increase this

# Imports
import numpy as np
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.cross_validation import train_test_split

# Set up data
iris = datasets.load_iris()
X = iris.data
y = iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

# Explore target names
iris.target_names

array(['setosa', 'versicolor', 'virginica'],
      dtype='<U10')

# Explore feature names
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

# Instantiate
kmc = KMeans(n_clusters=3, max_iter=1000, n_init=20)

# Fit
kmc.fit(X_train, y_train)

KMeans(copy_x=True, init='k-means++', max_iter=1000, n_clusters=3, n_init=20,
    n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
    verbose=0)

Clustering with KMeans

Unsupervised Learning: Clustering¶