Assignment 1 (100 pts)

Due Feb 19 - *Due date extended to March 3*

Helpful String Processing Hints

You can use cell arrays to store strings. For example, strarr{1} = 'hello'; strarr{2} = 'world'; strarr{3} = 'hello'; strarr{end+1} = 'goodbye'; will store strings in an array called strarr. You can then access the last element by str = strarr{end}; which will set the variable str to 'goodbye'


Here are some functions you might find useful and should look them up in matlab using 'help':

strfind - find one string within another - this would be useful for looking for occurences of strings like a href= or even characters like " or #
strtok - tokenize a string according to a delimiter
strcmp - compare two strings or to a string to a cell array of strings. For example if you have defined the variable strarr as above then the command inds = find(strcmp('hello',strcmp)); will return [1 3] the indices of the cell array that are equal to 'hello'
strrep - will replace all occurences of a string with another string
zeros - to create a matrix of size m by n use zeros(m,n)
find - find indices of nonzero elements.
sum - can act on matrices to add up over columns, a = sum(M) then a(1) will be the sum of column 1 of M, a(2) will be the sum of column 2 of M.
sort(w) - sort values in assending order. To do descending use sort(-w)
repmat - replicate and tile an array.
* - multiplication works on numbers, vectors, and matrices.

------------------------------------------------------------------------------------------------------------------

In this assignment you will be implementing the PageRank algorithm described in The Anatomy of a Search Engine" and explained in detail in Lecture 2.

There will be 2 parts to this assignment:

1. Extracting hyperlinks from the input data (70 pts)

The file allpages.txt contains a list of URLs, one per line. Each URL identifies an html page on the SBU Computer Science website. All html files can be found in this file: sunysb.tar.gz. To extract the web pages from this file use the command: tar zxvf sunysb.tar.gz and it will untar into a directory www.cs.sunysb.edu/ with all html files nested in this directory structure.

You will need to write code that reads the html pages. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set. For simplicity you may assume that all links we care about are contained within double quotes, ie links that look like

<a href="whatever"...

*See the file URLhints.html for hints on extracting hyperlinks from these web pages*

Use the links to pages that are in the test data set to build the link matrix (T) as defined in Lecture 2. For dangling pages (those with no outgoing links) make all transition probabilities uniform (ie set all the transition probabilities from dangling pages to any other page in the test set to 1/n where n is the total number of pages in the test set).

2. PageRank (30 pts)

Implement the PageRank algorithm with damping to calculate a ranking of the pages. Use a damping factor of 0.85.

What to upload to blackboard:

A file called README describing all files you have uploaded and how to run your code. Your code - this should include (though not necessarily in separate files) code to extract links from web pages, code to build the transition matrix T, and code to compute page rank. Please use comments in your code to denote where these pieces are located. A file called results.mat containing the transition matrix (T) you computed and the final pagerank of each webpage (w). A webpage called rankedresults.html showing links to the test web pages as ranked by your pagerank algorithm (these links should point to the online versions of each of these pages).

Feel free to discuss the assignment with others in the class, but please write the code on your own.

Extra Credit

Download your own set of web pages using wget or any of the freely available web crawlers and run your PageRank algorithm on these pages. Turn in a webpage to blackboard called extraresults.html showing these pages in ranked order. (10 pts)

Implement an inverted index for the words in these pages as described in The Anatomy of a Search Engine". Show the ranked results for various queries using query word matching in your inverted index and pagerank. Turn in webpages called query1.html, query2.html ... etc. Make sure to include the query string at the top of each web page. (10 pts)

Implement some of the other relevance measures described in The Anatomy of a Search Engine" or measures of your own devising. Show the results of combining these relevance features with pagerank for various queries. Turn in webpages called morequery1.html, morequery2.html, ... etc. Make sure to include the query string at the top of each web page (10 pts).