Cost function, back propagation, forward propagation, unrolling parameters, gradient checking, and random initialization.

## 1. Cost Function and Back Propagation

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

### 1a. Cost Function

- Neural Network Introduction
- One of the most powerful learning algorithms
- Learning algorithm for fitting the derived parameters given a training set

- Neural Network Classification
- Cost Function for Neural Network
- Two parts in the NN’s cost function
- First half (-1 / m part)
- For each training data (1 to m)
- Sum each position in the output vector (1 to K)

- For each training data (1 to m)
- Second half (lambda / 2m part)
- Weight decay term

- First half (-1 / m part)

- Two parts in the NN’s cost function

### 1b. Overview

- Forward propagation
- Algorithm that takes your neural network and the initial input (x) and pushes the input through the network

- Back propagation
- Takes output from your neural network H(Ɵ)
- Compares it to actual output y
- Calculates H(Ɵ)’s deviation from actual output

- Takes the error H(Ɵ) - y from layer L
- Back calculates error associated with each unit from the preceding layer L - 1
- Error calculated from each unit used to calculate partial derivatives

- Use partial derivatives with gradient descent to minimise cost function J(Ɵ)

- Takes output from your neural network H(Ɵ)
- Basic things to note
- Ɵ matrix for each layer in the network
- This has each node in layer l as one dimension and each node in l+1 as the other dimension

- Δ matrix for each layer
- This has each node as one dimension and each training data example as the other

- Ɵ matrix for each layer in the network

### 1c. Backpropagation Algorithm

- Gradient Computation
- Purpose is to find parameters Ɵ that minimizes J(Ɵ)

- Forward Propagation
- Backpropagation Equation
- For each node we can calculate δj^l
- This is the error of node j in layer l

- aj^l is the activation of node j in layer l
- activation would have some error compared to the “real” value
- delta calculates this error

- But the “real” value is an issue
- The neural network is an artificial construct we made
- We have to use the actual output, y and work from there

- Work from the last/output layer, L
- If L = 4
- delta^4 = a^4 - y

- We can determine error in the preceding layers once we’ve the error from the output layer
- delta^3
- delta^2
- There is no error in the first layer as it’s simply the actual input x

- Once we get the deltas, we can calculate the gradient or the partial derivative of the cost function
- Terms in equation
- Dimensions

- For each node we can calculate δj^l
- Backpropagation Algorithm

### 1d. Backpropagation Intuition

- Forward Propagation
- Backpropagation’s cost function with 1 output unit

## 2. Backpropagation in Practice

### 2a. Unrolling Parameters

- Advanced optimization
- Issue here is that we’ve to unroll the matrices into vectors for the algorithm fminunc

- Example
- s1 (layer 1 units) = 10
- s2 (layer 2 units) = 10
- s3 (layer 3 units) = 1
- Ɵ1(:) unrolls into a vector
- You can go back with reshape by pulling up each bunch of elements and reshape accordingly
- thetaVec(1:110) pulls up Ɵ1, the first 10 x 11 elements

- Learning Algorithm

### 2b. Gradient Checking

- It might look like J(Ɵ) is decreasing
- But you might not know that there is a bug
- You can do gradient checking to ensure your implementation is 100% correct

- We can numerically estimate gradients
- Octave implementation
- n is the dimension of Ɵ

- Implementation Note
- Always turn off your numerical checking code as it’s very slow to execute

### 2c. Random Initialization

- Initial value of Ɵ
- Initialising Ɵ with zeros would not work for neural networks
- After each update, parameters corresponding to inputs going into each of two hidden units are identical
- You NN would not learn anything interesting
- Solution is to have a random initialisation

- Random initialization: symmetry breaking
- You will be able to use gradient descent or any advanced optimisation methods to find good values for theta

### 2d. Putting everything together

- NN’s layers
- Six steps for training a neural network
- J(Ɵ) closeness to actual values
- Gradient descent: taking little steps downhill to find lowest J(theta)
- Backpropagation: computing direction of gradient
- Able to fit non-linear functions