Large Scale Machine Learning

Gradient descent with large data, stochastic gradient descent, mini-batch gradient descent, map reduce, data parallelism, and online learning.

1. Gradient Descent with Large Data Sets

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Learning with Large Data Sets

Why do we want large data set?
- This is evident when we take a low-bias learning algorithm and train it on a lot of data
  - The example of how “I ate two (two) eggs” shows how the algorithm performs well when we feed it a lot of data
Learning with large data sets has computational problems
- If m = 100m, we have to sum over 100m entries to compute one step of gradient descent
- Suppose you are facing a supervised learning problem and have a very large data set (m = 100m), how can you tell if the data is likely to perform much better than using a small subset (m = 100) of the data?
  - Plot a learning curve for a range of values of m and verify that the algorithm has high variance when m is small
    - We would then be confident that adding more training examples would increase the accuracy
    - But if it’s high bias, we do not need to plot to a large value of m
      - In this case, you can add extra features or units (in neural networks)

1b. Stochastic Gradient Descent

Suppose we are training a linear regression model with gradient descent
- If m is really large, we have to sum across all the examples
- This is actually called batch gradient descent when you look at all the training examples
We can use a stochastic gradient descent instead of a batch gradient descent
- Stochastic gradient descent (more efficient)
  - Define cost of parameter θ with respect to x_i and y_i
    - This measures how well the hypothesis is doing on a single training example (x_i, y_i)
  - This algorithm will take the first training example and take a step (modify parameter) to fit the first example better until the end of the training set
  - Difference
    - Rather than waiting to take the parts of all the training examples (batch gradient descent), we look at a single training example and we are making progress towards moving to the global minimum
    - Batch gradient descent (red path)
    - Stochastic gradient descent (magenta path with a more random-looking path where it wonders around near the global minimum)
    - In practice, so long the parameters are near the global minimum, it’s sufficient
    - We can repeat the loop maybe 1 to 10 times
    - It is possible even with 1 loop, where your m is large, you can have good parameters
    - Your J_train (cost function) may not decrease with every iteration for stochastic gradient descent

1c. Mini-Batch Gradient Descent

This may some times be faster than stochastic gradient descent
Difference
- Batch gradient descent
  - Use all m examples in each iteration
- Stochastic gradient descent
  - Use 1 example in each iteration
- Mini-batch gradient descent
  - Use b examples in each iteration
  - b = mini-batch size
    - b = 10 examples
Mini-batch gradient descent algorithm
- After looking at the 10 training examples, we start making progress in modifying the parameters
- Mini-batch gradient descent can outperform stochastic gradient descent if we use a vectorized implementation
- This algorithm is the same as batch gradient descent if b = m, where we iterate across all training examples

1d. Stochastic Gradient Descent Convergence

Checking for convergence
- When you average over a smaller number of examples, there might be too much noise
Stochastic gradient descent learning rate issue
- Decrease learning rate to allow convergence for a slightly better hypothesis

2. Advanced Topics

2a. Online Learning

What is online learning?
- When we have a continuous stream of information and we want to learn from that
- We can use an online learning algorithm to optimize some decisions
Example of online learning: learning whether to buy or not buy
- Shipping service website where user comes, specifies origin and destination, you offer to ship their package for some asking price, and users sometimes choose to use your shipping service (y = 1), sometimes not (y = 0)
- Features x captures properties of user, of origin/destination and asking price
- We want to learn p(y = 1 x;θ) to optimize price
  - Probability of 1 given price and features
  - We can use logistic regression or neural network
- Online learning algorithm
  - If you have a large number of users, you can do this
  - If you have a small number of users, you should save the data and train your parameters on your training data, not on a continuous stream of data
  - This online algorithm can adapt to users’ preferences
Example 2 of online learning: learning to search
- User searches for “Android phone 1080p camera”
- Have 100 phones in store with 10 results
- x = features of phone, how many words in user query match name of phone, how many words in query match description of phone
- y = 1 if user clicks on link
- y = 0 if otherwise
- We want to learn p(y = 1 x;θ) to predict CTR
- Use to show users’ 10 phones they’re most likely to click on
Other examples
- Choosing special offers to show user
- Customized selection of news articles
- Product recommendation

2b. Map Reduce and Data Parallelism

Map reduce and linear regression
- This is an alternative to stochastic gradient descent and mini-batch gradient descent
- It is exactly equal to batch gradient descent but
  1. Split the training sets
  2. Split to 4 computers
  3. Combine results
- Many learning algorithms can be expressed as computing sums of functions over the training set
Map reduce and logistic regression
- This can be done too as we can compute sums of fractions
Map reduce and neural network
- Suppose you want to apply map-reduce to train a neural network on 10 machines
  - In each iteration, compute forward propagation and back propagation on 1/10 of the data to compute the derivative with respect to that 1/10 of the data
Multi-core machines
- We can split training sets to different cores and then combine the results
  - “Parallelizing” over multiple cores in the same machine makes network latency less of an issue
  - There are some libraries to automatically “parallelize” by just implementing the usual vectorized implementation

Tags: