Web Search Engines: Architecture and Crawler

Required Reading

Additional Reading

Structure of a Search Engine

Google hit counts

Parenthetically: Google hit counts are quite unreliable and should never be used as a serious measure.

Crawlers

URL Normalization

E.g. in this page the links "href=index.html",
"href=http://www.cs.nyu.edu/courses/spring16/CSCI-GA.2580-001/"
href="http://www.cs.nyu.edu/courses/spring16/CSCI-GA.2580-001/index.html"
all point to the same page.

Courtesy toward server

Don't overload servers. 1 query at a time, and leave a time gap between queries.
Robot Exclusion Standard

Parallelism in Crawling.

Duplicate and Near-Duplicates Pages

We will discuss in a later lecture.

Refresh Strategies

Crawling is an eternal process. New pages are added and pages change. How frequently should you download a given page? (In a different universe, the Web could be set up using a "push" method, where servers sent an announcement to Google when an indexed page changed. That would presumably make the whole thing more efficient. However, not the way it works.)

What not to index

Other sources of URL's

gmail, "Previous page information" from browser, whatever Google can get. At this point, the crawler is perhaps best viewed as just one arm of Google's vast data collection process.