Build a deep neural network with ReLUs and Softmax.

## Deep Neural Networks: Introduction¶

**Linear Model Complexity**

- If we have N inputs and K outputs, we would have:
- (N+1)K parameters

- Limitation
- $y = x_1 + x_2$ can be represented well
- $y = x_1 * x_2$ cannot be represented well

- Benefits
- Derivatives are constants

**Rectified Linear Units (ReLUs)**

- This is a non-linear function.
- Derivatives are nicely represented too.

- Derivatives are nicely represented too.

**Network of ReLUs: Neural Network**

- We can do a logistic classifier and insert a ReLU to make a non-linear model.
- H: number of RELU units

**2-Layer Neural Network**

- The first layer effectively consists of the set of weights and biases applied to X and passed through ReLUs. The output of this layer is fed to the next one, but is not observable outside the network, hence it is known as a hidden layer.
- The second layer consists of the weights and biases applied to these intermediate outputs, followed by the softmax function to generate probabilities.
- A softmax regression has two steps: first we add up the evidence of our input being in certain classes, and then we convert that evidence into probabilities.

**Stacking Simple Operations**

- We can compute derivative of function by taking product of derivatives of components.

**Backpropagation**

- Forward-propagation
- You will have data X flowing through your NN to produce Y.

- Back-propagation
- Your labelled data Y flows backward to calculate "errors" of our calculations.
- You will be calculating the gradients ("errors"), multiply it by a learning rate, and use it to update our weights.
- We will be doing this many times.

**Go Deeper**

- It is better go deeper than increasing the size of the hidden layers (by adding more nodes)
- It gets hard to train.

- We should go deeper by adding more hidden layers.
- You would reap parameter efficiencies.
- However you need large datasets.
- Also, deep models can capture certain structures well such as the following.

**Regularization**

- We normally train networks that are bigger than our data.
- Then we try to prevent overfitting with 2 methods.
- Early termination
- Regularization
- Applying artificial constraints.
- Implicitly reduce number of free parameters while enabling us to optimize.
- L2 Regularization
- We add another term to the loss that penalizes large weights.
- This is simple because we just add to our loss.

- Early termination

- Then we try to prevent overfitting with 2 methods.

**L2 Regularization's Derivative**

- The norm of w is the sum of squares of the elements in the vector.
- The equation:
- The derivative:
- $ (\frac {1}{2} w^2)' = w $

- The equation:

**L2 Regularizatin: Dropout**

- Your input goes through an activation function.
- During the activation function, we randomly take half of the data and set to 0.
- We do this multiple times.

- We are forced to learn redundant information.
- It's like a game of whack-a-mole.
- There's always one or more that represents the same thing.

- Benefits
- It prevents overfitting.
- It makes network act like it's taking a consensus of an ensemble of networks.

**Dropout during Evaluation**

- We would take the expectation of our training y's.