Neural Networks (Representation)

Non-linear hypothesis, neurons and the brain, model representation, and multi-class classification.

1. Motivations

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Non-linear Hypothesis

You can add more features
- But it will be slow to process
If you have an image with 50 x 50 pixels (greyscale, not RGB)
- n = 50 x 50 = 2500
- quadratic features = (2500 x 2500) / 2
Neural networks are much better for a complex nonlinear hypothesis

1b. Neurons and the Brain

Origins
- Algorithms that try to mimic the brain
Was very widely used in the 80s and early 90’s
- Popularity diminished in the late 90’s
Recent resurgence
- State-of-the-art techniques for many applications
The “one learning algorithm” hypothesis
- Auditory cortex handles hearing
  - Re-wire to learn to see
- Somatosensory cortex handles feeling
  - Re-wire to learn to see
- Plug in data and the brain will learn accordingly
Examples of learning

2. Neural Networks

2a. Model Representation I

Neuron in the brain
- Many neurons in our brain
- Dendrite: receive input
- Axon: produce output
  - When it sends a message through the Axon to another neuron
  - It sends to another neuron’s Dendrite
Neuron model: logistic unit
- Yellow circle: body of neuron
- Input wires: dendrites
- Output wire: axon
Neural Network
- 3 Layers
  - 1 Layer: input layer
  - 2 Layer: hidden layer
    - Unable to observe values
    - Anything other than input or output layer
  - 3 Layer: output layer
  - We calculate each of the layer-2 activations based on the input values with the bias term (which is equal to 1)
    - i.e. x0 to x3
    - We then calculate the final hypothesis (i.e. the single node in layer 3) using exactly the same logic, except in input is not x values, but the activation values from the preceding layer
  - The activation value on each hidden unit (e.g. a12 ) is equal to the sigmoid function applied to the linear combination of inputs
    - Three input units
    - Ɵ(1) is the matrix of parameters governing the mapping of the input units to hidden units
      - Ɵ(1) here is a [3 x 4] dimensional matrix
  - Three hidden units
    - Then Ɵ(2) is the matrix of parameters governing the mapping of the hidden layer to the output layer
      - Ɵ(2) here is a [1 x 4] dimensional matrix (i.e. a row vector)
  - Every input/activation goes to every node in following layer
    - Which means each “layer transition” uses a matrix of parameters with the following significance
      - j (first of two subscript numbers)= ranges from 1 to the number of units in layer l+1
      - i (second of two subscript numbers) = ranges from 0 to the number of units in layer l
      - l is the layer you’re moving FROM
Notation

2a. Model Representation II

Here we’ll look at how to carry out the computation efficiently through a vectorized implementation. We’ll also consider why neural networks are good and how we can use them to learn complex non-linear things
Forward propagation: vectorized implementation
- g applies sigmoid-function element-wise to z
- This process of calculating H(x) is called forward propagation
  - Worked out from the first layer
  - Starts off with activations of input unit
  - Propagate forward and calculate the activation of each layer sequentially
Similar to logistic regression if you leave out the first layer
- Only second and third layer
- Third layer resembles a logistic regression node
- The features in layer 2 are calculated/learned, not original features
- Neural network, learns its own features
  - The features a’s are learned from x’s
  - It learns its own features to feed into logistic regression
  - Better hypothesis than if we were constrained with just x1, x2, x3
  - We can have whatever features we want to feed to the final logistic regression function
  - Implemention in Octave for a2
    - a2 = sigmoid (Theta1 * x);
Other network architectures
- Layer 2 and 3 are hidden layers

2. Neural Network Application

2a. Examples and Intuitions I

XOR/XNOR
- XOR: or
- XNOR: not or
AND function
- Outputs 1 only if x1 and x2 are 1
- Draw a table to determine if OR or AND
NAND function
- NOT AND
OR function

2b. Examples and Intuitions II

NOT function
XNOR function
- NOT XOR
- NOT an exclusive or
  - Hence we would want
    - AND
    - Neither

2c. Multi-class Classification

Example: identify 4 classes
- You would want a 4 x 1 vector for h_theta(X)
- 4 logistic regression classifiers in the output layer
- There will be 4 output
- y would be a 4 x 1 vector instead of an integer

Tags: