Cost function, back propagation, forward propagation, unrolling parameters, gradient checking, and random initialization.

1. Cost Function and Back Propagation

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Cost Function

  • Neural Network Introduction
    • One of the most powerful learning algorithms
    • Learning algorithm for fitting the derived parameters given a training set
  • Neural Network Classification
  • Cost Function for Neural Network
    • Two parts in the NN’s cost function
      • First half (-1 / m part)
        • For each training data (1 to m)
          • Sum each position in the output vector (1 to K)
      • Second half (lambda / 2m part)
        • Weight decay term

1b. Overview

  • Forward propagation
    • Algorithm that takes your neural network and the initial input (x) and pushes the input through the network
  • Back propagation
    • Takes output from your neural network H(Ɵ)
      • Compares it to actual output y
      • Calculates H(Ɵ)’s deviation from actual output
    • Takes the error H(Ɵ) - y from layer L
      • Back calculates error associated with each unit from the preceding layer L - 1
      • Error calculated from each unit used to calculate partial derivatives
    • Use partial derivatives with gradient descent to minimise cost function J(Ɵ)
  • Basic things to note
    • Ɵ matrix for each layer in the network
      • This has each node in layer l as one dimension and each node in l+1 as the other dimension
    • Δ matrix for each layer
      • This has each node as one dimension and each training data example as the other

1c. Backpropagation Algorithm

  • Gradient Computation
    • Purpose is to find parameters Ɵ that minimizes J(Ɵ)
  • Forward Propagation
  • Backpropagation Equation
    • For each node we can calculate δj^l
      • This is the error of node j in layer l
    • aj^l is the activation of node j in layer l
      • activation would have some error compared to the “real” value
      • delta calculates this error
    • But the “real” value is an issue
      • The neural network is an artificial construct we made
      • We have to use the actual output, y and work from there
    • Work from the last/output layer, L
      • If L = 4
      • delta^4 = a^4 - y
    • We can determine error in the preceding layers once we’ve the error from the output layer
      • delta^3
      • delta^2
      • There is no error in the first layer as it’s simply the actual input x
    • Once we get the deltas, we can calculate the gradient or the partial derivative of the cost function
    • Terms in equation
    • Dimensions
  • Backpropagation Algorithm

1d. Backpropagation Intuition

  • Forward Propagation
  • Backpropagation’s cost function with 1 output unit

2. Backpropagation in Practice

2a. Unrolling Parameters

  • Advanced optimization
    • Issue here is that we’ve to unroll the matrices into vectors for the algorithm fminunc
  • Example
    • s1 (layer 1 units) = 10
    • s2 (layer 2 units) = 10
    • s3 (layer 3 units) = 1
    • Ɵ1(:) unrolls into a vector
    • You can go back with reshape by pulling up each bunch of elements and reshape accordingly
      • thetaVec(1:110) pulls up Ɵ1, the first 10 x 11 elements
  • Learning Algorithm

2b. Gradient Checking

  • It might look like J(Ɵ) is decreasing
    • But you might not know that there is a bug
    • You can do gradient checking to ensure your implementation is 100% correct
  • We can numerically estimate gradients
  • Octave implementation
    • n is the dimension of Ɵ
  • Implementation Note
    • Always turn off your numerical checking code as it’s very slow to execute

2c. Random Initialization

  • Initial value of Ɵ
    • Initialising Ɵ with zeros would not work for neural networks
    • After each update, parameters corresponding to inputs going into each of two hidden units are identical
    • You NN would not learn anything interesting
    • Solution is to have a random initialisation
  • Random initialization: symmetry breaking
    • You will be able to use gradient descent or any advanced optimisation methods to find good values for theta

2d. Putting everything together

  • NN’s layers
  • Six steps for training a neural network
  • J(Ɵ) closeness to actual values
    • Gradient descent: taking little steps downhill to find lowest J(theta)
    • Backpropagation: computing direction of gradient
      • Able to fit non-linear functions