Friday, November 1, 2013

Project 3: Bayesian Spam Classifcation

TA: Pat

Goal: Give students understanding of how classifiers work in general and probabilistic methods used to tackle problems like classifying email as spam or ham

Implementation: Hopefully relying on the code from last year to serve the data, the students would have to extract features from the emails, write code to train & test the model and then evaluate results. Perhaps start off with having the students write their own simple classifier (i.e. if includes "viagra" spam, otherwise ham) in order to understand the classification problem in general and that multiple approaches exist.

Timeline:

Out: Tuesday 3/11

In: ???


Project Outline
Files they need:
-stencil code

Files on our end:
-enron email text w/ labels

Outline:

Overview:
- explain classification problem (using X and y)
- X and y
-training
-validation
-test(our grading)
- explain Enron
- overview the 3 parts


Part 1)
Feature Extraction
- features are words (should we touch on other potential features(i.e. sender, wordcount)?)
- given the emails, create a dictionary for easy computation of naive bayes


Part 2)
Classification in general
- there is no one perfect classifier
- try making your own
- one option: pick a word you expect to only appear in spam and use that as the rule
- report the results

Part 3)
Naive bayes
- show them the formula againa on the handout
-explain why marginal doesn't matter
-impress that since they are only proportional to likelihood, the probabilities will not sum to 1
- (tell them to take whichever has a larger posterior) - how much do we want them to figure out on their own?
   - more about how we want to make dedcusions, not just minimizing error (ROC curves)
- report the 5 words most associated with each class - we did this in 142 but the results seemed poor. Maybe clean the data to only include words that occur a few times.



Should we include this? need to test with Enron emails:

Part 4)
Different features
- use a bigram model
- explain what this is in NLP
- explain start/stop words
- otherwise the naive bayes should work the same way


No comments:

Post a Comment