Update -- parameters for they classify function have been updated (Tues, Feb 22). Please download the modified code for classify.m. Update 2 -- email homework write up to cseise364@gmail.com

HW1 Spam Classification - due Feb 24 - deadline extended to Feb 27

In this homework you will be designing and building a nearest neighbor spam classification system for predicting whether an email is spam or ham, described in lecture on Feb 8/10 (slides).

Part 1 - Designing Good Features (30 pts)

Come up with and describe a set of features you will use to classify an email as spam or ham. Features can be for example: words, phrases, special characters, meta-data about the from/to/subject fields. Credit will be given for completeness, usefulness, and creativity of your feature set. Remember you will have to implement these in Part 2 so make sure you will be able to extract these features from text.

Part 2 - Extracting Features from text (30 pts)

Fill in the implementation for extract.m to extract the features you designed in Part 1 from text. This function takes as input the name of a file and returns a document vector containing the value of each feature. For example, if your features are: [frequency of "the monkey", frequency of "frog", frequency of "%"] and the input file contains 4 occurences of "the monkey", 3 occurences of "frog", and 0 occurences of "%" then the function should return the following document vector [4 3 0].

Part 3 - Spam Classification Using Nearest Neighbors (30 pts)

Fill in the implementation for classify.m, a nearest neighbor classifier. This function takes as input a matrix containing document vectors from the training set, a vector containing the labels for the training set (1 for ham, 0 for ham), and a document vector for a test email. It returns a label for the test email (1 for spam, 0 for ham).

Reminder: Nearest Neighbor Algorithm - For a test email find its nearest neighbor in the training email set (using the SSD measure of distance). Predict the label for the test email (spam or ham) to be the label of that nearest neighbor.

Fill in the implementation for nearestNeighbor.m, which classifies all of the test emails as spam or ham, and calculates overall classification accuracy, accuracy at classifying spam emails, and accuracy at classifying ham emails. This function takes as input: spamtraining.txt (file containing names of training spam emails), hamtraining.txt (file containing names of training ham emails), spamtesting.txt (file containing names of testing spam emails), hamtesting.txt (file containing names of testing ham emails), It should return the overall accuracy, accuracy of ham and accuracy of spam predictions on the *testing* emails. Note this function should call your implementations of extract (on training and test emails) and classify (on test emails).

Data - Pre-processed spam and ham email data: emails.tar.gz.

To open this file under mac or linux, use the command: "tar zxvf emails.tar.gz".

Part 3 - Write-Up (10 pts)

Provide a homework web page or pdf document, including: Email your webpage/document and commented code to cseise364@gmail.com.

Extra Credit

1. Use raw spam/ham emails instead of pre-processed data (10 pts) - located here.

2. Divide your training set into 70 training emails and 30 held-out emails. Use the held-out set to predict a good value for k in your k-nearest neighbor classifier (10 pts).