Convolutional Neural Nets: Introduction¶

Translation Invariance

Image
- Different positions
- Same objects
Text
- Kitten in a long text
- You can use weight sharing and train them jointly for those inputs

Convnets

Neural networks that share their parameters across space
We take a portion of the image and run a neural network.
We then slide the neural network across the image
- Here you can see we've a layer that has a deeper depth but smaller space.
- We will slide the neural network on this layer that will again increase the depth and reduce the space.
- We continue to do this until we've reached a stage of maximum depth k where k are the outputs we want.
Instead of having stacks of matrix multipliers, we would have stacks of convolutions.
- - Here you can see we're trying to reduce the space and increase the depth.

Convnets Terms

Strides, depth and padding

Calculating Output Size

Padding Size

In general it's common to see same (zero) padding, stride 1 and filters of size FxF.
Zero-padding = $\frac {F - 1}{2}$
- If you do not pad (same padding), you would decrease the width and height of your layers gradually.
  - This might not be something you want.

Depth

Number of filters = depth.
We try to keep this in powers of two.
- 32, 64, 128, 512 etc.
- This is for computational reasons.

Number of Parameters

Convolution Networks

- Fully Connected (FC) Layer
  - Basically it connects to the entire input volume like a neural network.
  - Final layer after we have done all our convolutions.
- ReLU Layers
  - Remember there are ReLU Layers after every Conv and FC.

Advanced convnet-ology

Pooling

Striding
- We shift the filter by a few pixel each time.
- This is very aggressive method that removes a lot of information.
Pooling
- We can take a smaller stride.
- Take all the convolutions in the neighbors.
- Combine them somehow, and this is called pooling.
  - We will be preserving the depth.
  - But we will be reducing the width and height.
1. Max Pooling
  - At every point in a feature map, look at a small neighborhood around that point and compute the maximum of all the responses around it.
  - Typical architecture
1. Average pooling
  - Instead oftaking the max, we take the average.
  - It's similar to taking a blurred, low-resolution, view of the feature map.