Note my server hard drive crashed, so this is a reconstruction of the original Assignment1.html page to the best of my recollection.
In this assignment you will be implementing the PageRank algorithm described in The Anatomy of a Search Engine" and explained in detail in Lecture 2.
There will be 2 parts to this assignment:
1. PageRank (30 pts)
Implement the PageRank algorithm with damping as a function w =
PageRank(B,d) that takes as input a damping factor, d, and normalized link
matrix, B, and returns the PageRank of each page in a vector w. Your code for
this should be implemented in a file called PageRank.m. I suggest testing your
code on the examples from Lecture 2.
2. Building the link matrix and running on real data! (70 pts)
Implement a function B = BuildMatrix(basedir, filename), which processes a set of web pages and builds a normalized link matrix B from their links. Input to this function is a variable called 'basedir', which is the base directory for where the set of pages to be processed are stored, and 'filename' which is the name of a file listing the set of pages to be processed. Your function should return B, the normalized link matrix and should be implemented in a file called BuildMatrix.m.
You will need to write code that reads and processes a given set of html pages to build the normalized link matrix B. For each page in the set, extract each hyperlink from the page and increment the proper element of B. Ignore links to pages that are not in the set. For simplicity you may assume that all links we care about are contained within double quotes, ie links that look like
*See the file URLhints.html for hints on extracting hyperlinks from these web pages* For dangling pages -- those with no outgoing links, you should set the normalized transition probability to all pages to 1/n where n is the number of pages in the set.
I have provided a set of test web pages here: sunysb.tar.gz (to extract web pages from this file on a mac/linux machine use the command: tar zxvf sunysb.tar.gz to untar into a directory www.cs.sunysb.edu). I also provide a file allpages.txt which contains a list of pages/urls to be processed.
Run your code on the test dataset to first build B and then calculate the
PageRank of these pages, w. Given the PageRank of these pages, create an html
page called results.html which displays the urls of the pages in PageRank
order.
Download your own set of web pages using wget or any of the freely available web crawlers and run your PageRank algorithm on these pages. Turn in a webpage to blackboard called extraresults.html showing these pages in ranked order. (10 pts)
Implement an inverted index for all of the words in these pages as described in The Anatomy of a Search Engine". Show the ranked results for various queries using query word matching in your inverted index and pagerank. Turn in webpages called query1.html, query2.html ... etc. Make sure to include the query string at the top of the web page. (10 pts)
Implement some of the other relevance measures described in The Anatomy of a Search Engine" or measures of your own devising. Show the results of combining these relevance features with pagerank for various queries. Turn in webpages called morequery1.html, morequery2.html, ... etc. Make sure to include the query string at the top of the web page (10 pts).