Representing Web Graphs Sriram Raghavan and Hector Garcia-Molina

The Link Database: Fast Access to Graphs fo the Web Keith Randall et al. 2001

Compressing the Graph Structure of the Web Torsten Suel, Jun Yuan 2001

Efficient and Simple Encodings for the Web Graph Jean-Loup Guillaume et al. 2002

Towards Compressing Web Graphs Micah Adler, Michael Mitzenmacher 2002

Recomend * Mining the Web: Discovering Knowledge from Hypertext Data
* by Soumen Chakrabarti (Morgan Kaufmann)

From last time: Email from Kumar.

- Saving bits in (a) storage (b) transmission.
- Improving external storage algorithms. Combine compression with locality. and trade-off against locality.
- Compression reveals structure.
- Compressed form measures underlying information.

2. Links

(Ignoring anchors, text, etc.)

Sort alphabetically, then gzip: 5.55 bytes per URL

Kth URL in list has identifier K.

No locality

Sort, group in blocks, compress each block independently, keep index of
first URL, ID in block.

If block length => 1000 URLs, then 5.62 bytes per URL

For block length = 100 URLS, then 6.43 bytes per URL

ID to URL or URL to ID:

1. Use index to find block,

2. Decompress block,

3. Linear search in block.

Improve step 3 by padding URL's with blanks to get fixed-length URL's

Good value: 6.54 bytes per URL, 2 ms to convert.

- Host table: Compressed list of host names (in alpha order)
- Page table: Compressed list of path names (in sorted URL order; i.e. sorted by associated host primarily, secondarily by path.
- Index table: For each kth URL, pointer to host table, page table.

Compression of path names

lcp(i) = length of common prefix of URL[i] with URL[i-1].

dcp(i) = lcp(i) - lcp(i-1).

Page table consists of pairs: dcp(i) and
the suffix following the common prefix.

Full path is stored at entries pointed to by index table.

E.g. Suppose

path[100] = shakespeare/comedies,

path[101]=shakespeare/comedies/midsummer_nights_dream,

path[102]=shakespeare/comedies/midsummer_nights_dream/act1

path[103]=shakespeare/comedies/twelfth_night

path[104]=shakespeare/histories.

Suppose 100 is indexed directly in index table.

Then

lcp(101)=20, lcp(102)=43, lcp(103)=21, lcp(104)=12;

dcp(101) = 20, suffix(101) = "/midsummer_nights_dream"
dcp(102) = 23, suffix(102) = "/act1";

dcp(103) = -22, suffix= "twelfth_night";

dcp(104) = -9, suffix = "histories"

Compress dcp values using Huffman coding.

Compress common words (e.g. "index" "html" etc.) by Huffman codes.

Compress all two-letter sequences by Huffman codes.

Host names similar. (Would be better if order of fields in host names were reversed; in fact, this is a minor misdesign of URL system.)

Result: 6.49 bytes per URL.

For k=20, getURL = 0.2 msec, getIndex = 1.7 msec.

For k=40, getURL = 0.4 msec, getIndex = 1.7 msec.

Compress using bzip: 0.8 bytes per link.

Compress using gzip: 0.83 bytes per link.

No locality.

Break adjacency list into blocks. Compress blocks separately.

1.24 bytes per link, 0.45 msec lookup time.

** Locality **

Distance = difference in alphabetical order.

If U links to V, the distribution of distance D goes like D^{-1.16}

Idea #1: Record distance rather than index. Actually worsens compression (Why?)

Idea #2: Divide into short links (-255 < distance < 255) and long links.

68% of links are short.

Short link requires 10 bits. (8 for magnitude, 1 for sign, 1 to distinguish
short from long.)

Further improve using Huffman coding.

Long links require 3 bytes.

Can also distinguish medium links = 2 bytes.

Average link size | Average lookup time | |
---|---|---|

identifiers | 8 bytes | |

gzipped identifiers | 0.83 byte | |

distances | 4.16 bytes | |

gzipped distances | 1.1 byte | |

gzipped ids 8 line block | 1.61 bytes | 0.44 msec |

gzipped ids 16 line block | 1.36 bytes | 0.44 msec |

gzipped ids 32 line block | 1.24 bytes | 0.45 msec |

gzipped ids 64 line block | 1.20 bytes | 2.4 msec |

gzipped ids 128 line block | 1.21 bytes | 5.7 msec |

gzipped ids 256 line block | 1.21 bytes | 16.9 msec |

short and long links | 1.89 bytes | 20 microsecs. |

short, medium, and long links | 1.89 bytes | 20 microsecs. |

short (Huffman), medium, and long links | 1.54 bytes | 20 microsecs. |

- host(V) != host(U).
- is one of
*P*most popular pages (greatest number of inlinks). use Huffman code.

- Number of U outlinks to external host < 4: Use identifier.

- Number of U outlinks to external host >= 4:
- Sort ids in increasing order
- Compute successive differences
- Use Golomb coding (short code for small numbers).

- is one of
- host(V) = host(U)
- V is top 1% most popular pages within host (indegree within host): Use Huffman code.
- V is not specially popular page within host:

Collect statistics about distance distribution for various classes (10-15) of host;

Develop code for distance representation within host.

** Results **
Overall: 1.74 bytes per link. Access time = 0.26 msec / link.

Type of link | Number | Bytes per link |
---|---|---|

Global absolute | 1,636,836 | 2.64 |

Global frequent | 2,308,253 | 1.97 |

Local frequent | 1,015,728 | 1.06 |

Local distance | 9,748,832 | 1.30 |

Reference encoding: if outlinks(V) is very similar to outlinks(U),

then represent outlinks(V) by reference to U plus list of missing links plus
list of extra links

Optimal representation by reference encoding:

** Affinity graph **

Node is a web page with outlinks. Null root node R.

Cost(U -> V) = number of bits needed to encode outlinks(U) in terms
of outlinks(V)

Cost(R -> U) = number of bits to encode outlinks(U) with no reference.

Find minimum directed spanning tree for affinity graph rooted at R.

Compress each page in terms of its parent in tree.

Algorithm runs in time O(n log n + sum of squares of indegrees). (However, probably not feasible over actual Web graph, because of external memory.)

Suppose you allow definition of page P in terms of two pages U and V?

NP-hard even to approximate optimal solution.

Experimental results over sample of web: 8.85 bits per link.

Combined with Huffman coding, gives 8.35 bits per link.

Partition nodes of graphs into classes.

Each class of nodes = supernode.

Superedge from supernode U to V if there is an edge
from any element of U to some element of V.

- Intranode graph: Edges within supernode.
- Positive superedge graph: For superedge U -> V, all edges from element of U to element of V.
- Negative superedge graph: superedge graph: For superedge U -> V, all pairs a,b where a in U, b in V and no arc a -> b.

Observations

- Link copying
- Domain and URL Locality: Links within domain and to nearby URLs (lexicographically)
- Page similarity: Pages with similar links have similar subjects.

Properties

- Pages with similar adjacency lists are grouped together as much
as possible.

- All pages in same supernode are from same domain (top two levels
of URL).

- Among pages from same domain, pages lexicographically close are likely to be grouped together.

Algorithm

{ Initialize partition by top two levels of URL. (e.g. nyu.edu); repeat { S := random supernode if S has undergone fewer than 3 URL-splits then split S by next level in URL hierarchy; else cluster-split(S) until cluster-split has failed more than abort-max times in a row. } } cluster-split(S) { for (i := 0; i < max_iterations; i++) { k := outdegree(S) in supernode graph +2*i; C := a k-means clustering of S in terms of outlinks to other S-nodes; if C is an effective clustering, then return(C) } return(failure); }

** Compression tricks **

- Huffman code nodes of high indegree
- Reference encoding, as in Adler and Mitzenmacher
- Number pages by supernode.
- Other standard stuff.

**Results:**
5.07 bits per edge:

5.63 bits per edge for transpose (indexing inlinks).

Sequential access: 298 nanoseconds/edge. Random access: 702 nanoseconds/edge

(933 MHz Pentium III)

Email from Raghavan

1) negative superedges perhaps 8 to 10% of all superedges

2) negative superedges occur when clustered split divides up a tightly
integrated collection

3) Construction of Snode graph over 100 million pages took about 35 hours.

LINK2: List of index differences.

a) Huffman coded

b) Nibble coded. Sequence of nibbles (4 bit chunks): first bit of nibble
for stop flag; 3 bits for number, strung together; final bit of number
for sign. E.g. 28 = 0111 1000.

Nibble code slightly less efficient compression (9 bits per link) than Huffman (8 bits per link); much more efficient time (49 nsec vs. 117 nsec)

LINK3: Another version of reference encoding.

Measure average similarity between outlink list as function of lexicographic order. Descends from 0.5 at distance = 1, 0.32 at distance=20, 0.22 at distance=100.

Average similarity between inlink list: 0.42 at distance=1, 0.2 at distance=12, 0.18 at distance=20; 0.1 at distance = 80. (eyeballed from small graph).