You can use cell arrays to store strings. For example, strarr{1} = 'hello'; strarr{2} = 'world'; strarr{3} = 'hello'; strarr{end+1} = 'goodbye'; will store strings in an array called strarr. You can then access the last element by str = strarr{end}; which will set the variable str to 'goodbye'
Here are some functions you might find useful and should look them up in matlab using 'help':
strfind - find one string within another - this would be useful for looking for occurences of strings like a href= or even characters like " or #
strtok - tokenize a string according to a delimiter
strcmp - compare two strings or to a string to a cell array of strings. For example if you have defined the variable strarr as above then the command inds = find(strcmp('hello',strcmp)); will return [1 3] the indices of the cell array that are equal to 'hello'
strrep - will replace all occurences of a string with another string
zeros - to create a matrix of size m by n use zeros(m,n)
find - find indices of nonzero elements.
sum - can act on matrices to add up over columns, a = sum(M) then a(1) will be the sum of column 1 of M, a(2) will be the sum of column 2 of M.
sort(w) - sort values in assending order. To do descending use sort(-w)
repmat - replicate and tile an array.
* - multiplication works on numbers, vectors, and matrices.
------------------------------------------------------------------------------------------------------------------
In this assignment you will be implementing the PageRank algorithm described in The Anatomy of a Search Engine" and explained in detail in Lecture 2.
There will be 2 parts to this assignment:
1. Extracting hyperlinks from the input data (70 pts)
The file allpages.txt contains a list of URLs, one per line. Each URL identifies an html page on the SBU Computer Science website. All html files can be found in this file: sunysb.tar.gz. To extract the web pages from this file use the command: tar zxvf sunysb.tar.gz and it will untar into a directory www.cs.sunysb.edu/ with all html files nested in this directory structure.
You will need to write code that reads the html pages. For each page, extract each hyperlink out from the page. Ignore links to pages that are not in the test data set. For simplicity you may assume that all links we care about are contained within double quotes, ie links that look like
*See the file URLhints.html for hints on extracting hyperlinks from these web pages*
Use the links to pages that are in the test data set to build the link matrix (T) as defined in Lecture 2. For dangling pages (those with no outgoing links) make all transition probabilities uniform (ie set all the transition probabilities from dangling pages to any other page in the test set to 1/n where n is the total number of pages in the test set).
2. PageRank (30 pts)
Implement the PageRank algorithm with damping to calculate a ranking of the pages. Use a damping factor of 0.85.
Download your own set of web pages using wget or any of the freely available web crawlers and run your PageRank algorithm on these pages. Turn in a webpage to blackboard called extraresults.html showing these pages in ranked order. (10 pts)
Implement an inverted index for the words in these pages as described in The Anatomy of a Search Engine". Show the ranked results for various queries using query word matching in your inverted index and pagerank. Turn in webpages called query1.html, query2.html ... etc. Make sure to include the query string at the top of each web page. (10 pts)
Implement some of the other relevance measures described in The Anatomy of a Search Engine" or measures of your own devising. Show the results of combining these relevance features with pagerank for various queries. Turn in webpages called morequery1.html, morequery2.html, ... etc. Make sure to include the query string at the top of each web page (10 pts).