Build a deep neural network with ReLUs and Softmax.
Building a Deep Neural Network

Deep Neural Networks: Introduction

Linear Model Complexity

  • If we have N inputs and K outputs, we would have:
    • (N+1)K parameters
  • Limitation
    • $y = x_1 + x_2$ can be represented well
    • $y = x_1 * x_2$ cannot be represented well
  • Benefits
    • Derivatives are constants

Rectified Linear Units (ReLUs)

  • This is a non-linear function.
    • Derivatives are nicely represented too.

Network of ReLUs: Neural Network

  • We can do a logistic classifier and insert a ReLU to make a non-linear model.
    • H: number of RELU units

2-Layer Neural Network

  1. The first layer effectively consists of the set of weights and biases applied to X and passed through ReLUs. The output of this layer is fed to the next one, but is not observable outside the network, hence it is known as a hidden layer.
  2. The second layer consists of the weights and biases applied to these intermediate outputs, followed by the softmax function to generate probabilities.
    • A softmax regression has two steps: first we add up the evidence of our input being in certain classes, and then we convert that evidence into probabilities.

Stacking Simple Operations

  • We can compute derivative of function by taking product of derivatives of components.


  • Forward-propagation
    • You will have data X flowing through your NN to produce Y.
  • Back-propagation
    • Your labelled data Y flows backward to calculate "errors" of our calculations.
    • You will be calculating the gradients ("errors"), multiply it by a learning rate, and use it to update our weights.
    • We will be doing this many times.

Go Deeper

  • It is better go deeper than increasing the size of the hidden layers (by adding more nodes)
    • It gets hard to train.
  • We should go deeper by adding more hidden layers.
    • You would reap parameter efficiencies.
    • However you need large datasets.
    • Also, deep models can capture certain structures well such as the following.


  • We normally train networks that are bigger than our data.
    • Then we try to prevent overfitting with 2 methods.
      • Early termination
      • Regularization
        • Applying artificial constraints.
        • Implicitly reduce number of free parameters while enabling us to optimize.
        • L2 Regularization
          • We add another term to the loss that penalizes large weights.
          • This is simple because we just add to our loss.

L2 Regularization's Derivative

  • The norm of w is the sum of squares of the elements in the vector.
    • The equation:
    • The derivative:
      • $ (\frac {1}{2} w^2)' = w $

L2 Regularizatin: Dropout

  • Your input goes through an activation function.
    • During the activation function, we randomly take half of the data and set to 0.
    • We do this multiple times.
  • We are forced to learn redundant information.
    • It's like a game of whack-a-mole.
    • There's always one or more that represents the same thing.
  • Benefits
    • It prevents overfitting.
    • It makes network act like it's taking a consensus of an ensemble of networks.

Dropout during Evaluation

  • We would take the expectation of our training y's.