Homework 5. Assigned: Thu Mar 22. Due: Mon Apr 2 at 10 p.m.

Write a Matlab function that does the following:
- The input is:
  - an adjacency matrix M, defining the links in a miniature web of n nodes, with M(i,j)=1 if there is a link from page j to page i, and 0 otherwise. This is a sparse matrix using Matlab's built-in sparse matrix system.
  - a scalar alpha whose purpose is explained on the top of p.11 of Austin's paper.
  - a scalar tol: this is a tolerance for terminating the power method
  - an integer maxiter: this is for terminating the power method in case the tolerance is never satisfied
  You can get n from length(M), so it doesn't need to be an input parameter.
- Compute the matrix H (see Austin p.3) , which is the same as M except the nonzero columns are scaled to add up to 1. If M has any columns that are all 0, indicating a page with no outgoing links, the corresponding column of H should also be 0 (see Austin p.2).
- Run the power method to approximately compute the eigenvector of the Google matrix G corresponding to the eigenvalue 1. The Google matrix is defined on Austin p.11; it involves S, which is the same as H except for the addition of A (which consists of columns of 1/n's replacing the 0 columns (see Austin p.6)), as well as E, which is my name for what Austin calls 1, namely the matrix with every entry equal to 1. Important: do not compute the matrices A, S, E or G! This would be very inefficient for large n since A, S have "dense columns" (the ones that are all 1/n's ), while E and G are completely dense. Instead, implement the power method for G using the simple iteration v^k+1 = G v^k, using the equation at the bottom of Austin p.11 to see how to implement this efficiently. Make sure you read and follow the instructions in the final paragraph on p.11. Note that v^k is my name for Austin's I^k, the "importance vector", which should converge to the page rank if all is well.
- Set v⁰ to the vector of all 1/n's. Terminate the power method when the eigenvector residual Gv^k - v^k has norm &le tol or maxiter iterations is reached, whichever is first.
- Avoid for loops except the one needed to control the power iteration. The built-in function sum should be useful. The main work at each step of the power method is the sparse matrix product H*v. Besides this, whose cost depends on the sparsity of H, how many other operations are required inside this for loop? Answer this, in terms of n, in the comments.
- Return the following output arguments:
  - the final v (the approximate eigenvector, that is the approximate page rank vector)
  - the final eigenvector residual Gv - v
  - the number of iterations required (the final value k)
Write another function whose input is the page rank vector and a vector of strings called url that specifies the corresponding URLs. Use the built-in function sort to sort the page rank vector and print the corresponding URLs in order from highest to lowest page rank. The second output of sort returns an index vector that can be used to order the URLs.
Run your program on the adjacency matrices M and URLs url defined in the following Matlab data files (access the data with "load filename"):
- nyu20.mat: 20 web pages accessible from www.nyu.edu
- nyu200.mat: 200 web pages accessible from www.nyu.edu
- rand2000.mat: a randomly generated sparse 2000 by 2000 adjacency matrix: there is no url vector for this
- your own web data, that you can acquire by running the web crawler surfer.m initialized at your favorite web page. However, don't try to set the second argument too large. When I tried setting this to 1000 for www.nyu.edu, the crawl broke down at step 229, and I had to terminate it.
Use tol = 1e-6 and maxit = 1000 in all cases, and run the code for values of alpha equal to 0.95, 0.85, 0.75 and 0.65. For the NYU data sets, print the URLs and their page ranks in sorted order from highest to lowest page rank. Do these vary much for different values of alpha? For all the datasets, plot the number of iterations required against alpha using plot. Do these vary much with alpha? If the number of iterations reaches 1000 iterations without the tolerance being satisfied, there is probably a bug in your program.
As we discussed in class, the rate of convergence depends on the second largest of the absolute values of the eigenvalues of the Google matrix: the closer this is to 1, the slower the convergence. Modifying the power method to compute this eigenvalue is a little beyond the scope of this class, so instead, compute the Google matrix G explicitly for the nyu20.mat data (in contrast to the previous question, where it is important not to compute G explicitly), and compute all the eigenvalues by calling the built-in function eig. Then look at the eigenvalue absolute values using abs (the absolute value of a complex number is the square root of the sum of the squares of the real and imaginary parts). Investigate the truth of the claim in the 4th paragraph of p.11 of the Austin paper: that the second largest of the absolute values of the eigenvalues is alpha. If this claim is true, print output to support it. If it's not, figure out what Austin intended to say instead, printing output to support your claim.
This question demonstrates image compression using low rank approximation of a matrix via the singular value decomposition (SVD). (The SVD is not used for image compression in practice, but it is used in many other applications, such as machine learning.) Write a Matlab function with two input arguments
- a string, which is the name of a JPEG file containing an image to be approximated
- an integer r, which is the rank to be used
and two output arguments
- a string, which is the name of the new JPEG file for the approximated image,
- a struct, which has three fields, e.g. compressed.S, compressed.U, compressed.V, namely the singular values and vectors needed to reconstruct the low-rank approximate picture
The function must have comments explaining the inputs and outputs and what the function does, which is the following:
- import the file into Matlab using imread, where it is stored as a uint8 (8-bit unsigned integer) array
- convert this to the ordinary double array format using double
- compute the singular value decomposition of this array using svd
- compute the rank r approximation matrix from the retained singular values and corresponding singular vectors (see p.160 of the text)
- convert this back to the uint8 format using uint8
- save the low rank image as a new JPEG file using imwrite
- construct the second output argument, which would be useful for saving the approximated image efficiently - do not return all the singular values and vectors, just the first r
Furthermore, if the approximate rank input argument is 0, the function should display the singular values in a semilogy plot and prompt the user to enter the desired approximate rank using input. Test your function on one or more gray-scale images from the web or from your personal collection of pictures and
- check out how good your low rank images are by viewing them outside Matlab
- for one image, print the original and several low rank approximations along with a semilogy plot of the singular values.
The eigenvector decomposition finds scalars &lambda_k and vectors w_k such that A w_k = &lambda_kw_k. The singular value decomposition finds scalars &sigma_k and vectors u_k, v_k such that A v_k = &sigma_ku_k. If you compare the eigenvalue decomposition and the SVD for random A you will not see any relationship between them. However, there is a strong relationship if A is symmetric. What is it? Play with eig and svd until you see the relationship and explain it as well as you can.

Submit the homework, including function listings and everything else requested above, either by hard copy left under my office door or as a single pdf file which should be emailed directly to me.

Reading:

How Google Finds Your Needle in the Web's Haystack, by David Austin
Chapter 7 of Ascher-Greif