Gaussian naive bayes, bayesian learning, and bayesian networks
Gaussian Naive Bayes

## Naive Bayes Methods¶

Bayes Rule: Intuitive Explanation

• (Prior probability)(Test evidence) --> (Posterior probability)
• Example
• P(C) = 0.01
• 90% it is positive if you have C (Sensitivity)
• 90% it is negative if you don't have C (Specificity)
• prior
• P(C) = 0.01
• P(C') = 0.99
• P(Pos|C) = 0.9
• P(Pos|C') = 0.1
• P(Neg|C') = 0.9
• P(Neg|C) = 0.1
• joint
• P(C and Pos) = P(C)P(Pos|C) = (0.01)(0.9) = 0.009
• P(C'and Pos) = P(C')P(Pos|C') = (0.99)(0.1) = 0.099
• normalizer
• P(Pos) = P(C and Pos) + P(C' and Pos) = 0.108
• posterior
• P(C|Pos) = 0.009 / 0.108 = 0.0833
• P(C'|Pos) = 0.099 / 0.108 = 0.9167
• Adding both = 1.0

Bayes Rule: Example

• This is really good for text learning
• Example
• P(Chris) = 0.5
• P(Love|Chris) = 0.1
• P(Deal|Chris) = 0.8
• P(Life|Chris) = 0.1
• P(Sara) = 0.5
• P(Love|Sara) = 0.5
• P(Love|Deal) = 0.2
• P(Love|Life) = 0.3
In [40]:
p_chris_and_love_deal = 0.1*0.8*0.5

p_sara_and_love_deal = 0.5*0.2*0.5

normalizer = p_chris_and_love_deal + p_sara_and_love_deal

p_chris_given_love_deal = p_chris_and_love_deal / normalizer
p_sara_given_love_deal = p_sara_and_love_deal / normalizer

# P(Chris | "Love Deal")
print(p_chris_given_love_deal)

# P(Sara | "Love Deal")
print(p_sara_given_love_deal)

0.4444444444444445
0.5555555555555555


Bayes Rule: Theory

• Learn the best hypothesis given data and some domain knowledge
• Learn the most probable hypothesis given data and domain knowledge
• $$\underset{h∈H}{\mathrm{argmax}}P(h|D)$$
• h: some hypothesis
• D: some data
• argmax h∈H
• We want to maximize P(h|D)
• Bayes rule
• $$P(h|D) = \frac{P(D|h)P(h)}{P(D)}$$
• $$P(a,b) = P(a|b)P(b)$$
• $$P(a,b) = P(b|a)P(a)$$
• P(a,b) is the probability of a and b
• P(D)
• This is a normalizing term
• Prior on the data
• P(D|h)
• Data given the hypothesis
• $$D =\{x_i, d_i\}$$
• Training data, D, with inputs (x) and labels (d)
• What's the likelihood that given all of x_i and P(D|h) hypothesis is true, we will observe d's
• P(h)
• Prior on h
• Domain knowledge
• Say if you use KNN, you believe points close together would give similar outputs with higher likelihood than those far from one another

Bayesian Learning Algorithm

• For each h∈H, calculate P(h|D) ≈ P(D|h)
• Output
• $$h_1 = \underset{h}{\mathrm{argmax}}P(h|D)$$
• h_1: Maximum a posteriori
• $$h_2 = \underset{h}{\mathrm{argmax}}P(D|h)$$
• h_2: Maximum likelihood
• We're assuming P(h) and P(D) are uniform
• They're constants so we can ignore

Gaussian Naive Bayes

• Ultimately we've simplified, using Gaussian distribution, to minimizing the sum of squared errors!
• Based on bayes rule we've ended up deriving sum of squared error

Bayesian Classification

• The algorithm changes slightly here
• We are maximizing the weighted vote instead of simply P(h|D)

Version Space

• The version space is a subset of the hypothesis space, where the hypotehesis space is a space of all possible hypotheses
• The version space are those hypotheses such that they correctly predict the training data you have (essentially a 100% model fit)

Bayesian Networks, Bayesian Nets, Belief Networks or Graphical Models

• Representing and dealing with probabilities
• Conditional independence
• X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of y given the value of Z
• P(X=x | Y=y, Z=z) = P(X=x | Z=z)
• More compactly
• P(X|Y,Z) = P(X|Z)
• Order of graph must be topological
• Graph must be acylic
• No cycles
• Sampling
• Two things distributions are for
• Probability of value
• Generate values
• Reasons for sampling
• Simulation of a complex process
• Approximate inference
• For machines
• We can't find the exact values because it may be hard and slower
• Visualization
• For humans to get a feel
• Inferencing Rules
• Marginalization
• $$P(x) = \underset{y}\sum(x,y)$$
• Chain rule
• $$P(x,y) = P(y|x)P(x) = P(y|x)P(y)$$
• Bayes rule
• $$P(y|x) = \frac {P(x|y)P(y)}{P(x)}$$

Naive Bayes

• Say you've label A and B (hidden)
• Label A
• Have multiple words with different probabilities
• Every word gives evidence if it's label A
• We mutiply all the probabilities with the prior to find the joint probability of A
• Label B
• Have multiple words with different probabilities
• Every word gives evidence if it's label B
• We multiply all the probabilities with the prior to find the joint probability of B
• Now you can find out the probability of it being A or B
• Reason why it's called Naive
• It ignores word order!

Naive Bayes Benefits

• Inference is cheap
• Linear
• Few parameters
• Estimate parameters with labeled data
• Connects inference and classification
• Empirically successful

Naive Bayes Training

• In the training process of a Bayes calssification problem, the sample data does the following:
1. Estimate likelihood distributions of X for each value of Y
2. Estimate prior probability P(Y=j)

Gaussian Naive Bayes in Scikit-learn

In [26]:
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

In [27]:
# Create features' DataFrame and response Series
iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=6)

In [28]:
# Instantiate: create object
gnb = GaussianNB()

# Fit
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)

# Accuracy
acc = accuracy_score(y_test, y_pred)
acc

Out[28]:
0.92105263157894735
Tags: