Compressing the Web Graph

Representing Web Graphs Sriram Raghavan and Hector Garcia-Molina

The Link Database: Fast Access to Graphs fo the Web Keith Randall et al. 2001

Compressing the Graph Structure of the Web Torsten Suel, Jun Yuan 2001

Efficient and Simple Encodings for the Web Graph Jean-Loup Guillaume et al. 2002

Towards Compressing Web Graphs Micah Adler, Michael Mitzenmacher 2002

Recomend Mining the Web: Discovering Knowledge from Hypertext Data by Soumen Chakrabarti (Morgan Kaufmann)

From last time: Email from Kumar.

Significance of Compression

What to compress

1. URL's
2. Links
(Ignoring anchors, text, etc.)

Compressing URLs

Uncompressed: Average more than 70 bytes per URL.

Guillaume et al:

Gzip unsorted: 7.27 bytes per URL.
Sort alphabetically, then gzip: 5.55 bytes per URL
Kth URL in list has identifier K.
No locality

Sort, group in blocks, compress each block independently, keep index of first URL, ID in block.
If block length => 1000 URLs, then 5.62 bytes per URL
For block length = 100 URLS, then 6.43 bytes per URL

ID to URL or URL to ID:
1. Use index to find block,
2. Decompress block,
3. Linear search in block.
Improve step 3 by padding URL's with blanks to get fixed-length URL's
Good value: 6.54 bytes per URL, 2 ms to convert.

Suel and Yuan

Step 1: Sort by URL.

Compression of path names
lcp(i) = length of common prefix of URL[i] with URL[i-1].
dcp(i) = lcp(i) - lcp(i-1).
Page table consists of pairs: dcp(i) and the suffix following the common prefix.
Full path is stored at entries pointed to by index table.

E.g. Suppose
path[100] = shakespeare/comedies,
path[101]=shakespeare/comedies/midsummer_nights_dream,
path[102]=shakespeare/comedies/midsummer_nights_dream/act1
path[103]=shakespeare/comedies/twelfth_night
path[104]=shakespeare/histories.
Suppose 100 is indexed directly in index table.

Then
lcp(101)=20, lcp(102)=43, lcp(103)=21, lcp(104)=12;
dcp(101) = 20, suffix(101) = "/midsummer_nights_dream" dcp(102) = 23, suffix(102) = "/act1";
dcp(103) = -22, suffix= "twelfth_night";
dcp(104) = -9, suffix = "histories"

Compress dcp values using Huffman coding.
Compress common words (e.g. "index" "html" etc.) by Huffman codes.
Compress all two-letter sequences by Huffman codes.

Host names similar. (Would be better if order of fields in host names were reversed; in fact, this is a minor misdesign of URL system.)

Result: 6.49 bytes per URL.
For k=20, getURL = 0.2 msec, getIndex = 1.7 msec.
For k=40, getURL = 0.4 msec, getIndex = 1.7 msec.

Randall et al.

Don't describe compression scheme, but state a value of 9 bytes per URL. No timings given.

Compressing links

Huffman encode pages of large indegree

about 14-15 bits per link (both Adler & Mitzenmacher and Raghavan & Garcia-Molina)

Guillaume et al.

Represent graph by adjaceny lists of indices.
Compress using bzip: 0.8 bytes per link.
Compress using gzip: 0.83 bytes per link.
No locality.

Break adjacency list into blocks. Compress blocks separately.
1.24 bytes per link, 0.45 msec lookup time.

Locality
Distance = difference in alphabetical order.
If U links to V, the distribution of distance D goes like D-1.16

Idea #1: Record distance rather than index. Actually worsens compression (Why?)

Idea #2: Divide into short links (-255 < distance < 255) and long links.
68% of links are short.
Short link requires 10 bits. (8 for magnitude, 1 for sign, 1 to distinguish short from long.)
Further improve using Huffman coding.
Long links require 3 bytes.
Can also distinguish medium links = 2 bytes.

Average link size Average lookup time
identifiers 8 bytes
gzipped identifiers 0.83 byte
distances 4.16 bytes
gzipped distances 1.1 byte
gzipped ids 8 line block 1.61 bytes 0.44 msec
gzipped ids 16 line block 1.36 bytes 0.44 msec
gzipped ids 32 line block 1.24 bytes 0.45 msec
gzipped ids 64 line block 1.20 bytes 2.4 msec
gzipped ids 128 line block 1.21 bytes 5.7 msec
gzipped ids 256 line block 1.21 bytes 16.9 msec
short and long links 1.89 bytes 20 microsecs.
short, medium, and long links 1.89 bytes 20 microsecs.
short (Huffman), medium, and long links 1.54 bytes 20 microsecs.

Suel and Yuan

Link from page U to page V. Some more fiddly stuff that I omit.

Results Overall: 1.74 bytes per link. Access time = 0.26 msec / link.
Type of link Number Bytes per link
Global absolute 1,636,836 2.64
Global frequent 2,308,253 1.97
Local frequent 1,015,728 1.06
Local distance 9,748,832 1.30

Adler and Mitzenmacher

Theoretical.

Reference encoding: if outlinks(V) is very similar to outlinks(U),
then represent outlinks(V) by reference to U plus list of missing links plus list of extra links

Optimal representation by reference encoding:
Affinity graph
Node is a web page with outlinks. Null root node R.
Cost(U -> V) = number of bits needed to encode outlinks(U) in terms of outlinks(V)
Cost(R -> U) = number of bits to encode outlinks(U) with no reference.
Find minimum directed spanning tree for affinity graph rooted at R.
Compress each page in terms of its parent in tree.

Algorithm runs in time O(n log n + sum of squares of indegrees). (However, probably not feasible over actual Web graph, because of external memory.)

Suppose you allow definition of page P in terms of two pages U and V?
NP-hard even to approximate optimal solution.

Experimental results over sample of web: 8.85 bits per link.
Combined with Huffman coding, gives 8.35 bits per link.

Raghavan and Garcia-Molina

S-Node representation.

Partition nodes of graphs into classes.
Each class of nodes = supernode.
Superedge from supernode U to V if there is an edge from any element of U to some element of V.

Example:

Observations

Properties

Algorithm

{ Initialize partition by top two levels of URL. (e.g. nyu.edu);
  repeat {
     S := random supernode
     if S has undergone fewer than 3 URL-splits
        then split S by next level in URL hierarchy;
     else cluster-split(S)
   until cluster-split has failed more than abort-max times in a row.
  }
}

cluster-split(S)
{ for (i := 0; i < max_iterations; i++) {
    k := outdegree(S) in supernode graph +2*i;
    C := a k-means clustering of S in terms of outlinks to other S-nodes;
    if C is an effective clustering, then return(C)
  }
return(failure); }

Compression tricks

Results: 5.07 bits per edge:
5.63 bits per edge for transpose (indexing inlinks).
Sequential access: 298 nanoseconds/edge. Random access: 702 nanoseconds/edge
(933 MHz Pentium III)

Email from Raghavan
1) negative superedges perhaps 8 to 10% of all superedges
2) negative superedges occur when clustered split divides up a tightly integrated collection
3) Construction of Snode graph over 100 million pages took about 35 hours.

Randall et al.

LINK1: List of indexes.

LINK2: List of index differences.
a) Huffman coded
b) Nibble coded. Sequence of nibbles (4 bit chunks): first bit of nibble for stop flag; 3 bits for number, strung together; final bit of number for sign. E.g. 28 = 0111 1000.

Nibble code slightly less efficient compression (9 bits per link) than Huffman (8 bits per link); much more efficient time (49 nsec vs. 117 nsec)

LINK3: Another version of reference encoding.

Measure average similarity between outlink list as function of lexicographic order. Descends from 0.5 at distance = 1, 0.32 at distance=20, 0.22 at distance=100.

Average similarity between inlink list: 0.42 at distance=1, 0.2 at distance=12, 0.18 at distance=20; 0.1 at distance = 80. (eyeballed from small graph).

General comment (of mine)

Presumably the whole effect of lexicographic distance has to do with common prefix. Perhaps there's some way of using that directly, rather than via lexicographic distance?