Nearest Neighbor Spam Classification - due Feb 21

In this homework you will be building a nearest neighbor spam classification system for predicting whether an email is spam or ham, described in lecture on Jan 31 (slides).

Data - Pre-processed spam and ham email data: emails.tar.gz.
To open this file under mac or linux, use the command: "tar zxvf emails.tar.gz".

To perform classification, we split the data into 4 parts: spam training documents (listed in spamtraining.txt), ham training documents (listed in hamtraining.txt), spam testing documents (listed in spamtesting.txt), and ham testing documents (listed in hamtesting.txt). The spam/ham training documents will form your labeled training set (the documents whose label you know) and the spam/ham testing documents will form your testing set (the documents whose label you need to predict).

Computing Features (20 points)

As features for spam/ham classification we will use a lexicon of words. Compute a lexicon consisting of the set of unique words occurring more than k times in the entire document collection (k is a number you should set experimentally by looking at the lexicon produced for varying values of k=1,2,5,10,...).

Process Training Data (20 points)

Classifying Test Documents (20 points)

For each test document:

Measuring Performance (10 points)

Compute the overall accuracy -- how many documents in the testing set were classified correctly? Compute the spam accuracy -- for all of the documents in the spam testing set, how many were classified as spam? Compute ham accuracy -- for all of the documents in the ham testing set, how many were classified as ham?

Feature Design (20 points)

Come up with some additional features that might be useful for spam/ham classification (note you don't have to implement these, just describe them and why they might be useful for spam/ham classification).

Write-Up (10 points)

Provide a homework web page or pdf document, including: Email your webpage/document and commented code to cseise364@gmail.com.