Update -- parameters for they classify function have been updated (Tues, Feb 22). Please download the modified code for classify.m. Update 2 -- email homework write up to firstname.lastname@example.org
HW1 Spam Classification - due Feb 24 - deadline extended to Feb 27
In this homework you will be designing and building a nearest neighbor spam classification
system for predicting whether an email is spam or ham, described in lecture on Feb 8/10
Part 1 - Designing Good Features (30 pts)
Come up with and describe a set of features you will use to classify an email
as spam or ham. Features can be for example: words, phrases, special
characters, meta-data about the from/to/subject fields. Credit will be given
for completeness, usefulness, and creativity of your feature set. Remember you
will have to implement these in Part 2 so make sure you will be able to extract
these features from text.
Part 2 - Extracting Features from text (30 pts)
Fill in the implementation for extract.m to extract the features
you designed in Part 1 from text. This function takes as input the name of a file
and returns a document vector containing the value of each feature.
For example, if your features are: [frequency of "the monkey", frequency of "frog",
frequency of "%"] and the input file contains 4 occurences of "the monkey", 3
occurences of "frog", and 0 occurences of "%" then the function should return
the following document vector [4 3 0].
Part 3 - Spam Classification Using Nearest Neighbors (30 pts)
Fill in the implementation for classify.m, a nearest
neighbor classifier. This function takes as input a matrix containing document
vectors from the training set, a vector containing the labels for the training
set (1 for ham, 0 for ham), and a document vector for a test email. It returns
a label for the test email (1 for spam, 0 for ham).
Reminder: Nearest Neighbor Algorithm - For a test email find its nearest
neighbor in the training email set (using the SSD measure of distance). Predict
the label for the test email (spam or ham) to be the label of that nearest
Fill in the implementation for nearestNeighbor.m,
which classifies all of the test emails as spam or ham, and calculates overall
classification accuracy, accuracy at classifying spam emails, and accuracy at
classifying ham emails. This function takes as input:
spamtraining.txt (file containing names of training spam emails),
hamtraining.txt (file containing names of training ham emails),
spamtesting.txt (file containing names of testing spam emails),
hamtesting.txt (file containing names of testing ham emails),
It should return the overall accuracy, accuracy of ham and accuracy of spam predictions on the *testing*
emails. Note this function should call your implementations of extract (on training and
test emails) and classify (on test emails).
Data - Pre-processed spam and ham email data: emails.tar.gz.
To open this file under mac or linux, use the command: "tar zxvf emails.tar.gz".
Part 3 - Write-Up (10 pts)
Provide a homework web page or pdf document, including:
Email your webpage/document and commented code to email@example.com.
- Descriptions of your features and why you selected these features
- Description of your nearest neighbor classifier implementation and link to your code
- Description of your results -- including overall accuracy, accuracy at classifying spam emails, accuracy at classifying ham emails and some example emails where your algorithm did well and where it failed (and why).
1. Use raw spam/ham emails instead of pre-processed data (10 pts) - located here.
2. Divide your training set into 70 training emails and 30 held-out emails. Use the held-out set to predict a good value for k in your k-nearest neighbor classifier (10 pts).