Nearest Neighbor Spam Classification - due Feb 21
In this homework you will be building a nearest neighbor spam classification
system for predicting whether an email is spam or ham, described in lecture on
Jan 31 (slides).
Data - Pre-processed spam and ham email data: emails.tar.gz.
To open this file under mac or linux, use the command: "tar zxvf emails.tar.gz".
To perform classification, we split the data into 4 parts: spam training
documents (listed in spamtraining.txt),
ham training documents (listed in hamtraining.txt),
spam testing documents (listed in spamtesting.txt),
and ham testing documents (listed in hamtesting.txt).
The spam/ham training documents will form your labeled training set (the
documents whose label you know) and the spam/ham testing documents will form
your testing set (the documents whose label you need to predict).
Computing Features (20 points)
As features for spam/ham classification we will use a lexicon of words. Compute
a lexicon consisting of the set of unique words occurring more than k times in
the entire document collection (k is a number you should set experimentally by
looking at the lexicon produced for varying values of k=1,2,5,10,...).
Process Training Data (20 points)
- Compute a training document matrix, M, where each column is the word vector
representation for a document (M should have size LxD where L is the number of
words in your lexicon and D is the number of documents in the training set).
The word vector representation for a document denotes how many times each
lexicon word appears in that document.
- Create a training document label vector, L, where each element in
this vector is 1 (spam) or 0 (ham).
Classifying Test Documents (20 points)
For each test document:
- Compute its word vector representation, T (this should be a column vector)
- Compare T to each training document word vector in M using
the SSD (sum of squared distances) measure.
- Find the training document that's nearest to the test document (smallest
SSD) and transfer its label (from L) to the test document. This is your
predicted classification for the test document.
Measuring Performance (10 points)
Compute the overall accuracy -- how many documents in the testing set were
classified correctly? Compute the spam accuracy -- for all of the documents in
the spam testing set, how many were classified as spam? Compute ham accuracy -- for all of the
documents in the ham testing set, how many were classified as ham?
Feature Design (20 points)
Come up with some additional features that might be useful for spam/ham
classification (note you don't have to implement these, just describe them and why they might be useful
for spam/ham classification).
Write-Up (10 points)
Provide a homework web page or pdf document, including:
- Description of your nearest neighbor classifier implementation.
- Description of your results -- including overall accuracy,
accuracy at classifying spam emails, accuracy at classifying ham emails
and some example emails where your algorithm did well and where it
failed (and why).
- Descriptions of your proposed additional features and why you selected these features.
Email your webpage/document and commented code to cseise364@gmail.com.