Essential terms for understanding deep learning research papers, tutorials and textbooks.

Term | Description |
---|---|

Jacobian matrix | The matrix containing all partial derivatives of a function whose input and output are both vectors |

Hessian matrix | Similar to Jacobian matrix but it contains the second derivatives collected in a matrix |

First-order optimization algorithms | Optimization algorithms that use only a gradient such as gradient descent |

Second-order optimization algorithms | Optimization algorithms that use Hessian matrix like Newton’s method |

Constrained Optimization | Find the maximal or minimal value of f(x) |

Karush-Kuhn-Tucker (KKT) Approach | General solution to constrained optimization making use of generalized Lagrange function (Lagrangian) |

KKT Conditions | Simple set of properties that describe the optimal points of constrained optimization problems |

Hyperparameters | Machine algorithms’ settings that must be determined external to the learning algorithm itself |

Accuracy | Proportion of examples for which the model produces the correct output |

Error rate | Proportion of examples for which the model produces an incorrect output |

Design matrix | Matrix containing a different example in each row |

Underfitting | Model cannot obtain sufficiently low error value |

Overfitting | Large gap between training and test error |

Capacity | Model’s ability to fit functions |

Hypothesis space | Set of functions learning algorithm s allowed to select as being the solution |

Representational capacity | The model specifies which family of functions the learning algorithm can choose from when varying the parameters to reduce training objective |

Occam’s razor | Among competing hypotheses, one should choose the simplest one |

Vapnik-Chervonenkis (VC) dimension | Measures the capacity of a binary classifier |

Nearest neighbor Regression | Non-parametric model minimizing the L2 norm of the point and the surrounding points |

Parametric Models | Models that learn a function described by a finite-sized parameter vector such as Linear Regression. And if it has less than optimal capacity, it will asymptote with an error value more than the Bayes error |

Non-parametric Models | No limitation on parameters such as nearest neighbour regression. And more data yields better generalization |

Nearest neighbour regression | It simply stores X and y and when given x it looks up for the nearest entry and returns the label |

Bayes error | The error incurred by an oracle, knowing the true probability distribution that generates the data, making predictions from the true distribution p(x, y) |

Generalization error | It can never increase with more training examples |

No Free Lunch Theorem | Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points |

Weight decay | Large (underfitting), medium (just right), small (overfitting) |

Regularization | We can regularize a model simply by adding a penalty to the cost function called a Regularizer. There are other ways too and a more generic definition is: regularization is any modification we make to the algorithm that is intended to reduce the generalization error not the training error |

Hyperparameter | Settings that we can use to control the behaviour of the learning algorithm. The setting must be a hyperparameter because it is not appropriate to learn that hyperparameter for the training set such as hyperparameter controlling model capacity where it would always choose to maximize the model capacity for the training set that results in overftting |

Validation set | Examples that the training algorithm does not observe. This is not the test set. It is used to guide the selection of our hyperparameters. Since it is used to “train” the hyperparameters, the validation set error will underestimate the generalization error though typically by a smaller amount than the training error |

Test set | This is the set we use to estimate our generalization error after all our hyperparameter optimization is complete. If the test set is small, it can be problematic as this implies statistical uncertainty around the estimated test error |

K-fold Cross Validation | This is computationally expensive. We partition the data into k non-overlapping subsets and the test set can be estimated by taking the average test error across k-trials |

Point estimator or statistic | Point estimation is the attempt to provide the single “best” prediction of some quantity of interest. Any function of the data that is drawn i.i.d. Since the data is drawn i.i.d. any function of the data is random and therefore the point estimator is a random variable |

Function estimator | This can also be called point estimator. But a function estimator is the estimation of the relationship between input and output variables |

Bias | \(bias(\hat\theta) = \mathbf{E}(\hat\theta_m) - \theta\) Estimator \(\hat\theta_m\) is ubiased if \(bias(\hat\theta) = 0\) |