Lecture 13: Specialized Systems

Course Evaluations

Required Reading

DEADLINER: Building a New Niche Search Engine Andries Kruger et al. 2000

DEADLINER : Search engine for conference announcements

Objectives: (A) to build a tool; (B) to work toward a toolkit.

Document retrieval

Filtering

Finding proper paragraph

Rich collection of individually weak binary filters. Nayman-Pearson procedure finds rules for combining K filters that optimally trade-off precision vs. recall.

Sample Filters:

Title:
Match keywords in paragraph.
Match "on" in paragraph.
At most 30 words in paragraph. (Similar filters for other counts)
At least 75% of words are capitalized.
Change in indentation from previous/next paragraph
Date found in this paragraph.
Date found within next 5 paragraphs.
One of the first 7 paragraphs
Country name found in this paragraph
At least 20 white-space or non-alphanumeric characters (??)

Deadline:
Match the words "deadline", "by", "before", "later than", "on", "submit", "paper", ":", "important date", "deadline qualifiers" (each of these is a separate filter).
Find a date.

Representation of filter: Regular expression augmented with counting operators, arithmetic operators, format detection operators.

min(indentation(cur_par)) != min(indentation(prev_par))
#(cur_par, /[A-Z]w*@+/) / #(cur_par, /w+@+/) > = 0.75
(The number of capitalized words divided by the number of words > = 0.75. @ denotes whitespace.)

Extracting information from paragraph

Specialized heuristics. E.g.

Program committee

Performance

SVM Filtering.
Corpus: 592 Calls for Papers, 2269 negative examples, mostly conference related pages.
Training set: 249 positive, 1250 negative examples.
Test set: 343 positive, 1019 negative examples.
Results: Linear SVM: Positive accuracy = 88.1%. Negative accuracy = 98.7%.
Gaussian SVM: Positive accuracy = 95.9%. Negative accuracy = 98.6%.

Finding correct paragraph
Paper provides only hard-to-read graph, no figures. But doing the best I can:

For deadline:
Best single filter at 2% false positives gives 88% recall rate. (They don't say which this is, but probably the word "deadline".)
Best 4 filters combined at 2% false positives gives 92% recall rate.

Program committee even more successful.

Information extraction
Target Total Detected/Extracted Detected/Not Extracted Extraneous
Deadline 300 214 2 31
Committee and Affiliation 1455 1252 72 136

CORA

Building Domain-Specific Search Engines with Machine Learning Techniques Andrew McCallum et al. 1999

CORA --domain specific search engine for computer science research papers.
(Not currently running.) 33,000 papers as of 1999; 50,000 as of some later date.

Stage 1: Spidering.
Stage 2: Classification into Hierarchy.
Stage 3: Information extraction
Stage 4: User Interface: Keyword search, Field search, Hierarchy

Spidering

Starting crawl from home pages of CS departments and labs.

Learn to choose links to follow based on words surrounding anchor.
1. Assign values to pages thus far spidered by discounted future reward (reinforcement learning).
2. Use Naive Bayes to predict discretized values from bag of words surrounding anchor.

Experimental results: Finds relevant papers in test set much more quickly than breadth-first search. E.g. finds 75% of relevant papers after searching 11% of links vs. 30% of links for breadth-first search.
Training set = about 40,000 pages, 450,000 links.

Hierarchical classification

Hand constructed fixed hierarchy with keywords. 51-leaf hierarchy with a few keywords per node. Construction took 3 hours per node, examining conference proceedings and CS websites.

Classification:

Information extraction

Estimate the probability that a given string in the text expresses a particular field using a Hidden Markov Model.

Collect 500 research papers.
Manually tag 500 references with one of 13 classes: title, author, institution, etc. Use 300 as training set, 200 as test set.
Construct HMMs using a series of increasingly complex techniques.
Final accuracy: about 93%.

Compressing the Web Graph

Representing Web Graphs Sriram Raghavan and Hector Garcia-Molina

Compressing the Graph Structure of the Web Torsten Suel, Jun Yuan 2001

Efficient and Simple Encodings for the Web Graph Jean-Loup Guillaume et al. 2002

Towards Compressing Web Graphs Micah Adler, Michael Mitzenmacher 2002

Significance of Compression

What to compress

1. URL's
2. Links
(Ignoring anchors, text, etc.)

Compressing URLs

Uncompressed: Average more than 70 bytes per URL.

Guillaume et al:

Gzip unsorted: 7.27 bytes per URL.
Sort alphabetically, then gzip: 5.55 bytes per URL
Kth URL in list has identifier K.
No locality

Sort, group in blocks, compress each block independently, keep index of first URL, ID in block.
If block length => 1000 URLs, then 5.62 bytes per URL
For block length = 100 URLS, then 6.43 bytes per URL

ID to URL or URL to ID:
1. Use index to find block,
2. Decompress block,
3. Linear search in block.
Improve step 3 by padding URL's with blanks to get fixed-length URL's
Good value: 6.54 bytes per URL, 2 ms to convert.

Suel and Yuan

Step 1: Sort by URL.

Compression of path names
lcp(i) = length of common prefix of URL[i] with URL[i-1].
dcp(i) = lcp(i) - lcp(i-1).
Page table consists of pairs: dcp(i) and the suffix following the common prefix.
Full path is stored at entries pointed to by index table.

E.g. Suppose
path[100] = shakespeare/comedies,
path[101]=shakespeare/comedies/midsummer_nights_dream,
path[102]=shakespeare/comedies/midsummer_nights_dream/act1
path[103]=shakespeare/comedies/twelfth_night
path[104]=shakespeare/histories.
Suppose 100 is indexed directly in index table.

Then
lcp(101)=20, lcp(102)=43, lcp(103)=21, lcp(104)=12;
dcp(101) = 20, suffix(101) = "/midsummer_nights_dream" dcp(102) = 23, suffix(102) = "/act1";
dcp(103) = -22, suffix= "twelfth_night";
dcp(104) = -9, suffix = "histories"

Compress dcp values using Huffman coding.
Compress common words (e.g. "index" "html" etc.) by Huffman codes.
Compress all two-letter sequences by Huffman codes.

Host names similar. (Would be better if order of fields in host names were reversed; in fact, this is a minor misdesign of URL system.)

Result: 6.49 bytes per URL.
For k=20, getURL = 0.2 msec, getIndex = 1.7 msec.
For k=40, getURL = 0.4 msec, getIndex = 1.7 msec.

Compressing links

Huffman encode pages of large indegree

about 14-15 bits per link (both Adler & Mitzenmacher and Raghavan & Garcia-Molina)

Guillaume et al.

Represent graph by adjaceny lists of indices.
Compress using bzip: 0.8 bytes per link.
Compress using gzip: 0.83 bytes per link.
No locality.

Break adjacency list into blocks. Compress blocks separately.
1.24 bytes per link, 0.45 msec lookup time.

Locality
Distance = difference in alphabetical order.
If U links to V, the distribution of distance D goes like D-1.16

Idea #1: Record distance rather than index. Actually worsens compression (Why?)

Idea #2: Divide into short links (-255 < distance < 255) and long links.
68% of links are short.
Short link requires 10 bits. (8 for magnitude, 1 for sign, 1 to distinguish short from long.)
Further improve using Huffman coding.
Long links require 3 bytes.
Can also distinguish medium links = 2 bytes.

Average link size Average lookup time
identifiers 8 bytes
gzipped identifiers 0.83 byte
distances 4.16 bytes
gzipped distances 1.1 byte
gzipped ids 8 line block 1.61 bytes 0.44 msec
gzipped ids 16 line block 1.36 bytes 0.44 msec
gzipped ids 32 line block 1.24 bytes 0.45 msec
gzipped ids 64 line block 1.20 bytes 2.4 msec
gzipped ids 128 line block 1.21 bytes 5.7 msec
gzipped ids 256 line block 1.21 bytes 16.9 msec
short and long links 1.89 bytes 20 microsecs.
short, medium, and long links 1.89 bytes 20 microsecs.
short (Huffman), medium, and long links 1.54 bytes 20 microsecs.

Suel and Yuan

Link from page U to page V. Some more fiddly stuff that I omit.

Results Overall: 1.74 bytes per link. Access time = 0.26 msec / link.
Type of link Number Bytes per link
Global absolute 1,636,836 2.64
Global frequent 2,308,253 1.97
Local frequent 1,015,728 1.06
Local distance 9,748,832 1.30

Adler and Mitzenmacher

Theoretical.

Reference encoding: if outlinks(V) is very similar to outlinks(U),
then represent outlinks(V) by reference to U plus list of missing links plus list of extra links

Optimal representation by reference encoding:
Affinity graph
Node is a web page with outlinks. Null root node R.
Cost(U -> V) = number of bits needed to encode outlinks(U) in terms of outlinks(V)
Cost(R -> U) = number of bits to encode outlinks(U) with no reference.
Find minimum directed spanning tree for affinity graph rooted at R.
Compress each page in terms of its parent in tree.

Algorithm runs in time O(n log n + sum of squares of indegrees). (However, probably not feasible over actual Web graph, because of external memory.)

Suppose you allow definition of page P in terms of two pages U and V?
NP-hard even to approximate optimal solution.

Experimental results over sample of web: 8.85 bits per link.
Combined with Huffman coding, gives 8.35 bits per link.

Raghavan and Garcia-Molina

S-Node representation.

Partition nodes of graphs into classes.
Each class of nodes = supernode.
Superedge from supernode U to V if there is an edge from any element of U to some element of V.

Example:

Observations

Properties

Algorithm

{ Initialize partition by top two levels of URL. (e.g. nyu.edu);
  repeat {
     S := random supernode
     if S has undergone fewer than 3 URL-splits
        then split S by next level in URL hierarchy;
     else cluster-split(S)
   until cluster-split has failed more than abort-max times in a row.
  }
}

cluster-split(S)
{ for (i := 0; i < max_iterations; i++) {
    k := outdegree(S) in supernode graph +2*i;
    C := a k-means clustering of S in terms of outlinks to other S-nodes;
    if C is an effective clustering, then return(C)
  }
return(failure); }

Compression tricks

Results: 5.07 bits per edge:
5.63 bits per edge for transpose (indexing inlinks).
Sequential access: 298 nanoseconds/edge. Random access: 702 nanoseconds/edge
(933 MHz Pentium III)

Email from Raghavan
1) negative superedges perhaps 8 to 10% of all superedges
2) negative superedges occur when clustered split divides up a tightly integrated collection
3) Construction of Snode graph over 100 million pages took about 35 hours.