Support Vector Machines (SVMs)

Machine Learning theory and applications using Octave or Python.

1. Large Margin Classification

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Optimization Objective

So far we have seen mainly 2 algorithms, logistic regression and neural networks. There are more important aspects of machine learning:
- The amount of training data
- Skill of applying the algorithms
The SVM sometimes give a cleaner and more powerful way to learn parameters
- This is the last supervised learning algorithm in this introduction to machine learning
Alternative view of logistic regression
- If we want hθ = 1, we need z » 0
- If we want hθ = 0, we need z « 0
- If y = 1, only the first term would matter
  - Graph on the left
  - When z is large, cost function would be small
  - Magenta curve is a close approximation of the log cost function
- If y = 0, only the second term would matter
  - Magenta curve is a close approximation of the log cost function
- Diagram of cost contributions (y-axis)
Support Vector Machine
- Changes to logistic regression equation
  - We replace the first and second terms of logistic regression with the respective cost functions
  - We remove (1 / m) because it does not matter
  - Instead of A + λB, we use CA + B
    - Parameter C similar to the role (1 / λ)
    - When C = (1 / λ), the two optimization equations would give same parameters θ
Compared to logistic regression, it does not output a probability
- We get a direct prediction of 1 or 0 instead
  - If θTx is => 0
    - hθ(x) = 1
  - If θTx is <= 0
    - hθ(x) = 0

1b. Large Margin Intuition

Some times people call Support Vector Machines “Large Margin Classifiers”
SVM decision boundary
- If C is huge, we would want A = 0 to minimize the cost function
- How do we make A = 0
  - If y = 1
    - A = 0 such that θTx >= 1
  - If y = 0
    - A = 0 such that θTx <= -1
- Since we want to ensure A = 0, our optimization problem boils down to minimizing the later term only
SVM decision boundary: linearly separable case
- Black decision boundary
  - There is a larger minimum difference
  - Chosen by SVM because of the large margins between the line and the examples
- Magenta and green boundaries
  - Close to examples
- Distance between blue and black line: margin
- If C is very large
  - Decision boundary would change from black to magenta line
- If C is not very large
  - Decision boundary would be the black line
  - SVM being a large margin classifier is only relevant when you have no outliers

1c. Mathematics of Large Margin Classification

Vector inner product
- Brief details
  - u_transpose * v is also called inner product
  - length of u = hypotenuse calculated using Pythagoras’ Theorem
- If we project vector v on vector u (green line)
  - p = length of vector v onto u
    - p can be positive or negative
    - p would be negative when angle between v and u more than 90
    - p would be positive when angle between v and u is less than 90
  - u_transpose * v = p . ll u ll = u1 v1 + u2 v2 = v_transpose * v
SVM decision boundary: introduction
- We set the number of features, n, to 2
- As you can see that normalization in SVM is minimizing the squared norm of the square length of the parameter θ, ll θ ll^2
SVM decision boundary: projections and hypothesis
- When θ0 = 0, this means the vector passes through the origin
- θ projection will always be 90 degrees to the decision boundary
- Decision boundary choice 1: graph on the left
  - p1 is projection of x1 example on θ (red)
    - p1 . ll θ ll >= 1
    - For this to be true ll θ ll has to be large
  - p2 is a projection of x2 example on θ (magenta)
    - p2 . ll θ ll <= -1
  - For this to be true ll θ ll has to be large
  - But our purpose is to minimise ll θ ll^2
    - This decision boundary choice does not appear to be suitable
- Decision boundary choice2: graph on the right
  - p1 is projection of x1 example on θ (red)
    - p1 is much bigger so norm of θ, ll θ ll, can be smaller
  - p2 is a projection of x2 example on θ (magenta)
    - p2 is much bigger so norm of θ, ll θ ll, can be smaller
  - Hence ll θ ll^2 would be smaller
  - And this is why SVM would choose this decision boundary
  - Magnitude of margin is value of p1, p2, p3 and so on
    - SVM would end up with a large margin because it tries to maximize the margin to minimize the squared norm of θ, ll θ ll^2

2. Kernels

2a. Kernels I

Non-linear decision boundary
- Given the data, is there a different or better choice of the features f1, f2, f3 … fn?
- We also see that using high order polynomials is computationally expensive
Gaussian kernel
- We will manually pick 3 landmarks (points)
- Given an example x, we will define the features as a measure of similarity between x and the landmarks
  - f1 = similarity(x, l(1))
  - f2 = similarity(x, l(2))
  - f3 = similarity(x, l(3))
- The different similarity functions are Gaussian Kernels
  - This kernel is often denoted as k(x, l(i))
Kernels and similarity
Kernel Example
- As you increase sigma square
  - As you move away from l1, the value of the feature falls away much more slowly
Kernel Example 2
- For the first point (magenta), you will predict 1 because hθ >= 0
- For the second point (cyan), you will predict 0 because hθ < 0
We can learn complex non-linear decision boundaries
- We predict positive when we’re close to the landmarks
- We predict negative when we’re far away from the landmarks
Questions we have yet to answer
- How do we get these landmarks?
- How do we choose these landmarks?
- What other similarity functions can we use beside the Gaussian kernel?

2b. Kernels II

Choosing the landmarks
- For every training example, we’ll choose the landmarks with the exact locations
SVM with kernels
When we solve the following optimization problem, we get the features
- We do not regularize thetaθ, so it starts from 1
SVM parameters

3. SVMs in Practice

We would normally use an SVM software package (liblinear, libsvm etc.) to solve for the parameters θ
You need to specify the following
- Choice of parameter C
- Choice of kernel (similarity function)
  1. No kernel is essentially “linear kernel”
    - Predict “y = 1” if θ_transpose * x >= 0
    - Use this when n is large (number examples) & m is small
  2. Gaussian kernel
    - For this kernel, we have to choose σ^2
    - Use this when n is small (number of examples) and/or m is large
If you choose a Gaussian kernel
- Octave implementation
  - We have to do feature scaling before using Gaussian kernel
    - This is because if we don’t, ll x - l ll^2 would be dominated mainly by the features that are large in scale such as the 1000sqft feature
  - The Gaussian kernel is also parameterized by a bandwidth pa- rameter, σ, which determines how fast the similarity metric decreases (to 0) as the examples are further apart
Other choices of kernel
Multi-class classification
- Typically most packages have this function
Logistic Regression vs SVMs
- When do we use logistic regression and when do we use SVMs?
  - The key thing to note is that if there is a huge number of training examples, a Gaussian kernel takes a long time
  - The optimization problem of an SVM is a convex problem, so you will always find the global minimum
    - Neural Network: non-convex, may find local optima

Tags: