Lecture 8: Retrieving non-text

Invisible web (conclusion)

Grey line between invisible web and surface web.
If user makes a query to search engine, and then saves a link to the URL of the result in a surface web page, then a crawler can follow that link, even if the page is dynamically generated.

Pages that are too deep in site may not be indexed, even if they are high-quality, static pages reachable through standard hlinks. E.g. Of 248,706 pages of "Open Directory", AltaVista indexed only 17,833 = 7.2%; Fast and Northen Light had substantially fewer.

Retrieving Images

Method 1: Keyword search. Most standard search engines.
Captions, URL's, anchors.

Content-based image retrieval

ImageRover (1997) Spiders crawling for images
Analysis in terms of image features
Analysis carried out for six subimages: whole, center, 4 quadrants
Color and Texture analysis

Color Analysis

Image mapped to 3-D color space with psychological support.
Each dimension divided into 4, so 64 (=4x4x4) bins.

Texture Analysis

Texture characterized in terms of 16 paramters, each divided into 4 bins.

Tamura's (1978) visual texture properties (not used in ImageRover, so far as I can tell:

Data structure

Each image is point in 768 dimensional space. (= 6 * (64+64))
Data structure: approximate k-d tree,
Given distance D, can retrieve all instances closer than D and exclude all instances further than D(1+e) in poly-log time.

User interface

Relevance feedback. User shown random images, asked for most similar.
Shown similar images, feedback on most relevant, etc.


Indexes by: keyword, BW/color, image dimension, number of faces, size of largest face.


Extremely unsystematic. Sporadic surprisingly strong results.

Essential Problem

Unlike words, there are no easily computed features of an image that approximate semantic content.

Web Services

Characterize Web services for purpose of No attempt to do this automatically; extensive, manually written descriptions. (Presumably, business web service providers have both the incentive and the manpower to do this.)
All XML based

Two major directions of research.

Both characterized by immensely detailed and rather abstract standards and remarkably verbose languages.

Web Services Languages


Document contains:

WSIL -- Web Services Inspection Language

Introduction to WSIL
WSIL: Do we need another Web Services Specification? Tarak Modi

Semantic Web

DAML -- DARPA Agent Markup Language

OIL --- Ontology Interchange Language


Built on DAML+OIL.
DAML-S: Web Service Description for the Semantic Web
Language for Semantic Web Services: DAML-S

Service Profile

Service Model -- Process


Process operators: Sequence, concurrent, Split, Spilt+Join, Unordered, Choice, If-Then-Else, Repeat-Until, Repeat-While.

Abstraction operators: Collapse (compound process -> blackbox atomic process); Expand (the inverse).

Semantic Matching of Web Services Capabilities

by Paolucci et al.

Characterize service advertisement by types of input demanded, type of output provided.
Characterize request by types of input supplied, type of output needed.

Matching criterion:

Major Problem: Only classes, no relations.
E.g. "car" matches "station wagon" OK.
But no way to distinguish a request for "Mother's maiden name" from "My last name".
No representation of relation between input and output.
E.g. if input is "car" and output is "money", representation does not distinguish between price, rental price, lease price, average cost of maintenance per year, average cost of gas per mile ...

Minor problem: requestor required to predict every damn fool piece of information that a service might ask for.