Spam classifier example, error analysis, skewed data, precision, recall and large data sets.

1. Building a Spam Classifier

I would like to give full credits to the respective authors as these are my personal python notebooks taken from deep learning courses from Andrew Ng, Data School and Udemy :) This is a simple python notebook hosted generously through Github Pages that is on my main personal notes repository on https://github.com/ritchieng/ritchieng.github.io. They are meant for my personal review but I have open-source my repository of personal notes as a lot of people found it useful.

1a. Prioritizing

• Let’s say you want to build a spam classifier
• How do you implement supervised learning?
• We can create the following
• x = features of email
• Choose 100 words indicative of spam or not spam
• In practice is to look through training set and choose most frequently occurring n words (10 000 to 50 000)
• y = spam (1) or non-spam (0)
• Example
• How do you spend your time to have a low error?
• Collect lots of data
• Develop sophisticated features based on email routing information (from email header)
• Develop sophisticated features for message body
• Should ‘discount’ and ‘discounts’ be treated as the same word?
• How about ‘deal’ and ‘Dealer’?
• Punctuation?
• Develop sophisticated algorithm to detect misspellings
• med1cine
• w4tches
• m0rtage
• Don’t base on your gut feeling!

1b. Error Analysis

• Recommended Approach
• Start with a simple algorithm that you can quickly implement and test it on your cross validation data
• Plot learning curves to decide if more data, more features, etc. are likely to help
• Error analysis:
• Manually examine the examples (in cross validation set) that your algorithm made errors on
• See if you spot any systematic trend in what type of examples it is making errors on
• Don’t base anything off your gut feeling!
• Error Analysis Example
• m_cv = 500
• number of cross-validation examples
• Algorithm misclassified 100 emails
• Manually examine the 100 errors and categorize them base don
• What type of email they are
• Pharma: 12
• Replica: 4
• Phishing: 53
• Other: 31
• What cues (features) you think would have helped the algorithms classify them correctly
• Deliberate misspellings: 5
• m0rtgage
• med1cine
• This indicates how there are a small number here
• Unusual email routing: 16
• Unusual punctuation: 32
• This might be worthwhile to spend time to develop sophisticated features
• This is the reason why we should do a quick and dirty implementation to discover errors and identify areas to focus on
• Importance of numerical evaluation
• Should discount, discounts, discounted, discounting etc. be treated as the same word?
• You can use a “stemming” software, “Porter Stemmer”
• This would allow you to treat all those variations as the same work
• Software may mistake universe and university as the same word
• Error analysis may not be helpful for deciding if this is likely to improve performance
• The only way is to try it
• We need a numerical solution (cross validation error)of algorithm’s performance with and without stemming
• Without stemming: 5%
• With stemming: 3%
• This implies that it may be useful to implement stemming
• Distinguish between upper and lower case: 3.2%
• Why is the recommended approach to perform error analysis using the cross validation data instead of the test data?
• If we develop new features by examining the test set, then we may end up choosing features that work well specifically for the test set, so Jtest(θ) is no longer a good estimate of how we generalize to new examples
• Do error analysis on cross validation set, do not use it on the test set!

2. Handling Skewed Data

2a. Error Metrics for Skewed Classes

• Consider a problem where you want to find out if someone has cancer
• y = 1, cancer
• y = 0, no cancer
• You train a logistic regression model, h0(x) and you find that you have 1% error on a test set
• 99% correct diagnosis
• But only 0.50% of patients have cancer
• This is a problem of skewed classes
• This code would have a 0.5% error, lower than your logistic regression model but it’s simply predicting based on 0.5% of patients who have cancer
  function y = predictCancer(x)
y = 0;
return

• Let’s say you have
• 99.2% accuracy
• 0.8% error
• If you improve your algorithm to become 99.5% accuracy
• 0.5% error
• It might be the case of just predicting whether you have cancer that would yield this error
• Precision/Recall
• By calculating precision/recall, we will have a better sense of how our algorithm is doing
• If y = 0
• Recall = 0
• This shows that the classifier is not good

2b. Trading off Precision and Recall

• If we want to avoid false positives
• We want to be more confident before predicting cancer (y = 1)
• We can increase the threshold of h0(x) from 0.5 to 0.7 or even 0.9
• Result
• False positives: decrease
• True positives: decrease
• Recall: decrease
• Precision: higher
• If we want to avoid false negatives
• We want to avoid missing too many cases of cancer
• We can decrease the threshold of h0(x) from 0.5 to 0.3
• Result
• False negatives: decrease
• True positive: increase
• Recall: higher
• Precision: lower
• Many different precision recall curve, but here is one example
• How do we compare precision/recall numbers? Which pair is best?
• We can use an average
• At the extremes, neither classifiers is good
• If we predict y = 1 all the time, it’s a useless classifier even though if it has a high recall
• But average is not good because the extreme scenarios may have a lower average than other combinations that may be better
• We should use the F score (F1 score)
• F1 Score = (2 * P * R) / (P + R)
• Remember to measure P and R on the cross-validation set and choose the threshold which maximizes the F-score

3. Using Large Data Sets

• Under certain conditions, getting a lot of data and training a learning algorithm would result in very good performance
• Designing a high accuracy learning system
• Classify between confusable words
• to, two, too
• then, than
• For breakfast, I ate two eggs
• Algorithms
• Perceptron (logistic regression)
• Winnow (less popular)
• Memory-based (less popular)
• Naive Bayes (popular)
• Algorithms give roughly similar performance
• With larger training set, all algorithms’ precision increase
• Often, it is not who has the best algorithm, but who has the most data
• Large data rationale
• In sum
• Low bias: use complex algorithm
• Low variance: use large training set
Tags: