CS6200/IS4200: Information Retrieval

Homework 2

Return to basic course information.

Assigned: Friday, 27 September 2019
Due: Friday, 11 October 2019, 11:59 p.m.


In this assignment, you will compute PageRank on a collection of 469,235 web sites.

Consider the version of PageRank described in class. PageRank can be computed iteratively as shown in the following pseudocode:

// P is the set of all pages; |P| = N
// S is the set of sink nodes, i.e., pages that have no out links
// M(p) is the set (without duplicates) of pages that link to page p
// L(q) is the number of out-links (without duplicates) from page q
// d is the PageRank damping/teleportation factor; use d = 0.85 as a fairly typical value

foreach page p in P
  PR(p) = 1/N                          /* initial value */

while PageRank has not converged do
  sinkPR = 0
  foreach page p in S                  /* calculate total sink PR */
    sinkPR += PR(p)
  foreach page p in P
    newPR(p) = (1-d)/N                 /* teleportation */
    newPR(p) += d*sinkPR/N             /* spread remaining sink PR evenly */
    foreach page q in M(p)             /* pages pointing to p */
      newPR(p) += d*PR(q)/L(q)         /* add share of PageRank from in-links */
  foreach page p
    PR(p) = newPR(p)

return PR

In order to facilitate the computation of PageRank using the above pseudocode, we would normally preprocess a document collection into a link graph, i.e., a set of records recording when document p links to document q.

Consider the following directed graph:


We can represent this graph as a collection of nodes, here, ordered pairs of node index and node name:

  0 A
  1 B
  2 C
  3 D
  4 E
  5 F
and a collection of directed links, i.e., ordered pairs from source to target:
  0 1
  0 2
  0 5
  1 2
  1 3
  1 4
  1 5
  2 3
  2 4
  3 0
  3 2
  3 4
  3 5
  4 0
  5 0
  5 1
  5 4
We use integer identifiers for the nodes for efficiency. Note that, unlike this example, in a real web graph, not every page will have in-links, nor will every page have out-links.


Final requests

In addition to the written items mentioned above, you should hand in a copy of your source code, which should hopefully be relatively short, and instructions on (compiling and) running it. The only input to your code should be the files in the vertex and edge formats described above. The output should be a list of node IDs, website names, and their PageRank values.