Machine Learning theory and applications using Octave or Python.

1. Large Margin Classification

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Optimization Objective

  • So far we have seen mainly 2 algorithms, logistic regression and neural networks. There are more important aspects of machine learning:
    • The amount of training data
    • Skill of applying the algorithms
  • The SVM sometimes give a cleaner and more powerful way to learn parameters
    • This is the last supervised learning algorithm in this introduction to machine learning
  • Alternative view of logistic regression
    • If we want hθ = 1, we need z » 0
    • If we want hθ = 0, we need z « 0
    • If y = 1, only the first term would matter
      • Graph on the left
      • When z is large, cost function would be small
      • Magenta curve is a close approximation of the log cost function
    • If y = 0, only the second term would matter
      • Magenta curve is a close approximation of the log cost function
    • Diagram of cost contributions (y-axis)
  • Support Vector Machine
    • Changes to logistic regression equation
      • We replace the first and second terms of logistic regression with the respective cost functions
      • We remove (1 / m) because it does not matter
      • Instead of A + λB, we use CA + B
        • Parameter C similar to the role (1 / λ)
        • When C = (1 / λ), the two optimization equations would give same parameters θ
  • Compared to logistic regression, it does not output a probability
    • We get a direct prediction of 1 or 0 instead
      • If θTx is => 0
        • hθ(x) = 1
      • If θTx is <= 0
        • hθ(x) = 0

1b. Large Margin Intuition

  • Some times people call Support Vector Machines “Large Margin Classifiers”
  • SVM decision boundary
    • If C is huge, we would want A = 0 to minimize the cost function
    • How do we make A = 0
      • If y = 1
        • A = 0 such that θTx >= 1
      • If y = 0
        • A = 0 such that θTx <= -1
    • Since we want to ensure A = 0, our optimization problem boils down to minimizing the later term only
  • SVM decision boundary: linearly separable case
    • Black decision boundary
      • There is a larger minimum difference
      • Chosen by SVM because of the large margins between the line and the examples
    • Magenta and green boundaries
      • Close to examples
    • Distance between blue and black line: margin
    • If C is very large
      • Decision boundary would change from black to magenta line
    • If C is not very large
      • Decision boundary would be the black line
      • SVM being a large margin classifier is only relevant when you have no outliers

1c. Mathematics of Large Margin Classification

  • Vector inner product
    • Brief details
      • u_transpose * v is also called inner product
      • length of u = hypotenuse calculated using Pythagoras’ Theorem
    • If we project vector v on vector u (green line)
      • p = length of vector v onto u
        • p can be positive or negative
        • p would be negative when angle between v and u more than 90
        • p would be positive when angle between v and u is less than 90
      • u_transpose * v = p . ll u ll = u1 v1 + u2 v2 = v_transpose * v
  • SVM decision boundary: introduction
    • We set the number of features, n, to 2
    • As you can see that normalization in SVM is minimizing the squared norm of the square length of the parameter θ, ll θ ll^2
  • SVM decision boundary: projections and hypothesis
    • When θ0 = 0, this means the vector passes through the origin
    • θ projection will always be 90 degrees to the decision boundary
    • Decision boundary choice 1: graph on the left
      • p1 is projection of x1 example on θ (red)
        • p1 . ll θ ll >= 1
        • For this to be true ll θ ll has to be large
      • p2 is a projection of x2 example on θ (magenta)
        • p2 . ll θ ll <= -1
      • For this to be true ll θ ll has to be large
      • But our purpose is to minimise ll θ ll^2
        • This decision boundary choice does not appear to be suitable
    • Decision boundary choice2: graph on the right
      • p1 is projection of x1 example on θ (red)
        • p1 is much bigger so norm of θ, ll θ ll, can be smaller
      • p2 is a projection of x2 example on θ (magenta)
        • p2 is much bigger so norm of θ, ll θ ll, can be smaller
      • Hence ll θ ll^2 would be smaller
      • And this is why SVM would choose this decision boundary
      • Magnitude of margin is value of p1, p2, p3 and so on
        • SVM would end up with a large margin because it tries to maximize the margin to minimize the squared norm of θ, ll θ ll^2

2. Kernels

2a. Kernels I

  • Non-linear decision boundary
    • Given the data, is there a different or better choice of the features f1, f2, f3 … fn?
    • We also see that using high order polynomials is computationally expensive
  • Gaussian kernel
    • We will manually pick 3 landmarks (points)
    • Given an example x, we will define the features as a measure of similarity between x and the landmarks
      • f1 = similarity(x, l(1))
      • f2 = similarity(x, l(2))
      • f3 = similarity(x, l(3))
    • The different similarity functions are Gaussian Kernels
      • This kernel is often denoted as k(x, l(i))
  • Kernels and similarity
  • Kernel Example
    • As you increase sigma square
      • As you move away from l1, the value of the feature falls away much more slowly
  • Kernel Example 2
    • For the first point (magenta), you will predict 1 because hθ >= 0
    • For the second point (cyan), you will predict 0 because hθ < 0
  • We can learn complex non-linear decision boundaries
    • We predict positive when we’re close to the landmarks
    • We predict negative when we’re far away from the landmarks
  • Questions we have yet to answer
    • How do we get these landmarks?
    • How do we choose these landmarks?
    • What other similarity functions can we use beside the Gaussian kernel?

2b. Kernels II

  • Choosing the landmarks
    • For every training example, we’ll choose the landmarks with the exact locations
  • SVM with kernels
  • When we solve the following optimization problem, we get the features
    • We do not regularize thetaθ, so it starts from 1
  • SVM parameters

3. SVMs in Practice

  • We would normally use an SVM software package (liblinear, libsvm etc.) to solve for the parameters θ
  • You need to specify the following
    • Choice of parameter C
    • Choice of kernel (similarity function)
      1. No kernel is essentially “linear kernel”
        • Predict “y = 1” if θ_transpose * x >= 0
        • Use this when n is large (number examples) & m is small
      2. Gaussian kernel
        • For this kernel, we have to choose σ^2
        • Use this when n is small (number of examples) and/or m is large
  • If you choose a Gaussian kernel
    • Octave implementation
      • We have to do feature scaling before using Gaussian kernel
        • This is because if we don’t, ll x - l ll^2 would be dominated mainly by the features that are large in scale such as the 1000sqft feature
      • The Gaussian kernel is also parameterized by a bandwidth pa- rameter, σ, which determines how fast the similarity metric decreases (to 0) as the examples are further apart
  • Other choices of kernel
  • Multi-class classification
    • Typically most packages have this function
  • Logistic Regression vs SVMs
    • When do we use logistic regression and when do we use SVMs?
      • The key thing to note is that if there is a huge number of training examples, a Gaussian kernel takes a long time
      • The optimization problem of an SVM is a convex problem, so you will always find the global minimum
        • Neural Network: non-convex, may find local optima