Lecture 7: Specialized Search Engines
Specialized Search Engines
Advantages to specialized seach engine derive from
- Databases: Generate dynamic pages from private database.
- Local search engines: Deliver static pages at same web site
using search engine interface.
- Special topic search engines: Deliver pages of specified types or topic
from many sites.
- Predictable format.
- Restricted subject.
- Restricted user pool.
- Smaller (100,000 or a million documents vs. billions of documents)
- Focussed crawling
- Deeper crawling within a site
- Use local search engine interface.
- Content supplier reports (e.g. bookfinder.com, Froogle)
- Lose serendipity (relevant page far from where you would expect it).
- Document individuation
- Conflation of "the same" document
- Indexing subdocuments within a page (e.g. individual articles or
- Query language
- Specialized fields
- Specialized input form
- Special symbols, terminology (e.g. chemical formula)
- Alternative medium (e.g. audio)
- Expert level language (e.g. for reference librarians).
- Ranking criterion. E.g. Authority of web site; user evaluation;
Citations other than hyperlinks (Citeseer) etc.
- Domain-specific search term relations.
- Even standard IR measures such as IDF change just as a result of
narrowing the collection; allows greater precision.
- Relations between pages and data collection
- Postprocessing. E.g. Wolfram Alpha (will deal with later under information
Collective business site
Collective business web sites of all kinds (bookfinder.com, expedia.com,
Google shopping etc.)
Merchant sends database (and thereafter updates) in a uniform format
to site. Site collates, enables a uniform search engine.
Similarly for engines for restricted categories of merchandise.
(Cars, real estate, books, etc.) Restricting the categories
enables domain-specific query attributes, drop-down value lists,
appropriate presentation and interaction, greater precision.
More of a database
than a search engine, but does support keyword matching.
Precursor to Google Scholar, but has the advantage of having been
Collects CS research papers; structures by citation.
Use search engine with keywords like "publications" or "papers" as
starting point for crawl. Also uses known online journals, proceedings.
Single document processing
Converts Postscript, PDF to plain text. Translate to various formats
Extracts index terms.
Extracts fields: Author, Title, Date, Pub, Abstract.
Identifies bibliography and references.
Identifies citations in text.
Locate author home pages.
Identify reference with document. Note: Wide variance in form of referece.
plus typos, errors
Identify common references to external document (i.e. not online)
Textual similarity between documents.
Co-citation similariy between documents.
Quotation from document with snippet.
Order by decreasing number of citations.
Title, date, authors, abstract, Citations to paper, similar docs
at sentence level. bibliography, similar docs based on text, related
docs from co-citation, histogram by year of number of citing articles.
Other Specialized Search
- Medical search.
Leverage (my conjecture): Limited set of authoritative sources. Limited
class of documents. Limited terms.
Indexes by chemical name, formula. The issue here is the specialized nature of
terms and term matching.
- Searching for tables. Discussed below.
- Multimedia search.
Search by image, by hum etc. Will discuss in a later lecture.
- Software Search. Will discuss in a later lecture.
- Language-specific search. Will discuss in a later lecture.
Not clear what leverage you get over Google, except (a) possibly user
interface; (b) with obscure languages or character sets (e.g. Sumerian
cuneiform) Google may not be willing to make the investment.