Linear Regression with Multiple Variables

Linear Regression with Multiple Variables.

1. Multivariate Linear Regression

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Multiple Features (Variables)

X1, X2, X3, X4 and more
New hypothesis
Multivariate linear regression
- Can reduce hypothesis to single number with a transposed theta matrix multiplied by x matrix

1b. Gradient Descent for Multiple Variables

Summary
New Algorithm

1c. Gradient Descent: Feature Scaling

Ensure features are on similar scale
- Gradient descent will take longer to reach the global minimum when the features are not on a similar scale
- Feature scaling allows you to reach the global minimum faster
So long they’re close enough, need not be between 1 and -1
Mean normalization

1d. Gradient Descent: Checking

Can you a graph
- x-axis: number of iterations
- y-axis: min J(theta)
Or use automatic convergence test
- Tough to gauge epsilon
Gradient descent that is not working (large learning rate)

1e. Gradient Descent: Learning Rate

Alpha (Learning Rate) too small: slow convergence
Alpha (Learning Rate) too large:
- J(theta) may not decrease on every iteration
- May not converge (diverge)
Start with 0.001 and increase x3 each time until you reach an acceptable alpha
- Choose a slightly smaller number than that acceptable alpha value

1f. Features and Polynomial Regression

Ensure the features capture the pattern
- Doesn’t make sense to choose quadratic equation for house prices
- Use cubic or square root
There are automatic algorithms, and this will be discussed later

2. Computing Parameters Analytically

2a. Normal Equation

Method to solve for theta analytically
If theta is real number
- Minimise J(theta) is to take the derivative and equate to zero
- Solve for theta
If theta is not
- Take partial derivative and equate to zero
- Solve for all thetas
Minimise Cost Function: Specific Example
- X: m x (n + 1)
  - m: number of training examples
  - n: number of features
- X_transpose: (n + 1) x m
- X_transpose * X: (n + 1) x m * m x (n + 1) = (n + 1) x (n + 1)
- (X_transpose * X)^-1 * X_transpose: (n + 1) x (n + 1) * (n + 1) x m = (n + 1) x m
- theta = (n + 1) x m * m x 1 = (n + 1) x 1
Minimise Cost Function: General
Minimise Cost: Octave Code
- No need for feature scaling using normal equation
- pinv (X' * X) * X' * y
Gradient Descent vs Normal Equation

Gradient Descent	Normal Equation
Need to choose alpha	No need to choose alpha
Needs many iterations	Don’t need to iterate
Works with large n (10,000)	Slow if n is large (100, 1000 is fine)
Number of features > 1000	So long number features < 1000

2b. Normal Equation Non-invertibility

What happens if X_transpose * X is non-invertible (singular or degenerate)
- pinv (X' * X) * X' * y
- This works regardless if it is non-invertible
Intuition of non-invertibility
- Causes of non-invertibility
- Delete redundant features to solve non-invertibility problem
- Delete some features or use regularization

Tags: