Lecture 10. Media: Images and Music

(I can't find any articles about video web retrieval I really liked.)

All media


Query language

Pervasive terminology: The "semantic gap"; the difference between media features that are easy to compute from the content and the features of interest to the user.

Very broadly speaking, the state of the art: Use signal processing theory to define local features in the image/waveform (e.g. edgelets). Use high-powered supervised classification ML techniques to build and tune a classifier for the semantic features of interest. Evaluate.

Web Images

Highly Recommended

80 Million Tiny Images: A Large Dataset for Non-parametric Object and Scene Recognition A. Torralba, R. Fergus, and W. Freeman

Survey articles

Image Retrieval: Ideas, Influences, and Trends of the New Age R. Datta et al., ACM Computing Surveys 40:2, April 2008. Recent, exhaustive, and dull.

Image Retrieval from the World Wide Web: Issues, Techniques, and Systems M.L. Kherfi, D. Ziou, and A. Bernardi ACM Computing Surveys 36:1, 2004. Much more readable.

General issues

Categorize images

This can be largely done reasonably accurately on the basis of easily determined image characteristics. E.g. Trademarks etc. tend to have regions of simple structure, uniform color, high contrast. Icons are small, almost by definition. There are reasonably accurate filters for nude photographs based on color distribution and shapes. Etc.

Duplicate images or image parts

Efficient Near-Duplicate Detection and Sub-Image Detection Yan Ke, Rahul Suktahankar, Larry Huston, ACM Intl Conf Multimedia 2004.

Find duplicate images, given changes in format, resolution, cropping, merging, geometric transformation.

Method: Compute transformation-invariant image features of subregions of the image. Use "locality sensitive hashing" for approximate similarity retrieval.

Text-based query

Associate text with image:

Keywords for a given image file from:

Weight by "closeness" of text to image, and text characteristics (e.g. font size).

Google generally does well with this, but reasonably often makes mistakes that can only be understood if you look at the embedding page.

Recent Work

Recent work on web images tends to be characterized by



Matching Words and Pictures K. Barnard et al. Journal of Machine Learning Research 2003.

User studies show a large disparity between user needs and what technology supplies (Armitage and Enser 1997, Enser 1993, 1995). This work make hair-raising reading --- an example is a request to a stock photo library for "Pretty girl doing something active, sporty in a summery setting, beach -- not wearing lycra, exercise clothes -- more relaxed in tee-shirt. Feature is about deodorant, so girl should look active -- not sweaty, but happy, healthy, carefree -- nothing too posed or set up -- nice and natural looking."
Cite various studies of requests to image collections.

80 Million Tiny Images: A Large Dataset for Non-parametric Object and Scene Recognition A. Torralba, R. Fergus, and W. Freeman

PowerPoint slides

With overwhelming amounts of data, many problems can be solved without the need for sophisticated algorithms.

32 x 32 color pictures are generally recognizable. Lower resolution does not work. Vector of 3072 dimensions (1024 pixels x 3 colors) = 3072 Bytes per image.

General idea: Collect from the web a vast collection of annotated images, and use nearest neighbors to classify.

Data Set Collection

Nearest neighbors.
D(I1,I2) = sumx,y,c [I1(x,y,c) - I2(x,y,c)]2.
DWarp(I1,I2) = minimize [over transformation T] RawDist(I1,T(I2)) where T is a combination of translation, scaling, and horizontal mirror.
DShift(I1,I2) further allows in X and Y of individual pixels by 5 pixels.

Use of Wordnet
Convert wordnet into a tree of terms by extracting the most common meaning of all the words, and using the hypernym (supercategory) relationship. Then when searching for a category, you can include all words that are subcategories; e.g. if looking for person, include "artist", "politician", "kid" etc.

Annotation . Collect nearest neighbors. Each image "votes" for its label plus all supercategories

Person detection
Find 80 nearest neighbors, see how many are labelled "person" or (more usually) subcategory. Note: Better for pictures where the person is large (a) because easier to match (b) because label is more likely to refer to the person.

Person localization Extract multiple crops of the picture, renomalize to 32x32, see which crops match.

Scene recognition Collect votes among nearest neighbors for subcategory of "location" (e.g. "landscape", "workplace", "city" etc.)

Image colorization Given a grey scale image, find nearest neighbors in grey, apply average color.

Image orientation Try all rotations, find the orientation with the best match.


Indexing points in a large dimensional space, to come close to a non-sparse query. Various techniques.

Training image classifiers from images collected off the web

Fergus et al.
As many as 85% of the returned images may be visually unrelated to the intended category, perhaps arising from polysemes (e.g. "iris" can be iris-flower, iris-eye, Iris-Murdoch). Even the 15% subset which do correspond to the category are substantially more demanding than images in typical training sets --- the number of objects in each image is unknown and variable and the pose (visual aspect) and scale are uncontrolled.

Animals on the Web T. Berg and D. Forsyth

Animal images are particularly hard to identify (a) because they can adopt multiple poses, and are often seen from odd angles (b) because they have evolved to be camouflaged.

Learning Object Categories from Google's Image Search R. Fergus et al.

Harvesting Image Databases from the Web F. Schroff, A Criminisi, A. Zisserman

18 categories: Airplane, beaver, bike, boat, camel, car, dolphin, elephant, giraffe, guitar, horse, kangaroo, motorbike, penguin, shark, tiger, wrist watch, zebra.

Compare three downloading methods:

Filter out non-photographs based on image characteristics. Overall precision goes from 29% to 35%, number of in-class examples goes from 13,000 to 10,000. (Varies considerably across categories.)

Rank images in each category using surrounding text plus meta-data. Naive Bayes on various text features (file name, word within 10 of image link etc.)

Train on visual features (similar to Fergus') using SVM.

Results: At 15% recall getting overall 86% precision.

Scene Completion using Millions of Photographs James Hayes and Alexei Efros

Note 4th example in figure 6, where algorithm has actually removed the scaffolding.

Evaluation: Subjects evaluated doctored photos as real 37% of the time. Note however that subjects only evaluated real photos as real 87% of the time. 34% of doctored photos marked as fake within 10 seconds (as opposed to 3% of real photos).

A survey of browsing models for content based image retrieval Daniel Heesch Multimedia Tools and Applications 40:2, 2008, 261-284.

Advantages to browsing:


However you do this you are limited by the semantic gap; the features that the program can detect are not the ones the user is interested in.

Music retrieval

Music Retrieval: A Tutorial and Review, Nicola Orio, Foundations and Trends in Information Retrieval 1:1, 2006. Elementary introduction.

Five Approaches to Collecting Tags for Music Douglas Turnbull, Luke Barrington, and Gert Lanckriet, ISMIR 2008.

"Cold start" problem: An item that is not annotated cannot be retrieved.

"Strong labelling": each item is labelled with each feature. If a feature is missing, the item reliably lacks it. "Weak labelling": missing labels are nulls.

"Popularity bias": Popular items (the "short head") are more thoroughly annotated than unpopular ones (the "long tail").

Largely applies to any kind of tagging of media.

Approach StrengthsWeaknesses
Survey custom-tailored vocabulary small, predetermined vocabulary
high-quality annotations human-labor intensive
strong labelling unscalable
Social tags collective wisdom of crowds create and maintain popular social network
open vocabulary ad-hoc annotation, weak labelling
provides social context (?) cold start, popularity bias
Game collective wisdom of crowds "gaming" the system
incentive for high-quality annotation difficult to create a successful game
fast paced, rapid data collection listen to clips rather than whole song
Web large publicly available corpus noisy annotations
no direct human involvement missing long tail
provides social context weak labelling
Content based no cold-start, popularity bias computationally difficult
no direct human involvement limited by training data
strong labelling solely audio content


pandora.com . Tracks annotated by human experts. Estimate 20-30 minutes of human time per track. (Incidentally there is an interesting long list of things their licencing agreement does not allow them to do. )

Note: Gaming as a method of collecting annotation tags was invented by Luis von Ahn in The ESP game for labelling images. This was extremely successful, and von Ahn got a job at CMU, a MacArthur fellowship, etc. As far as I know, this level of success has not been achieved by any subsequent game for collecting annotation tags.

Content-Based Music Information Retrieval: Current Directions and Future Challenges M.A. Casey et al., Proceedings of the IEEE 96:4, 2008.

Tasks and applications :

Low-level features: See Carey et al. 672-674. Most are very technical. One that is interesting is "onset detection"; i.e. marking when a note begins. One would think this would be obvious in the audio signal but apparently not.

High-level features: Timbre, Melody, Bass, Rhythm, Pitch, Harmony (chord sequence extraction), Key, Structure (segmentation), Lyrics. The analysis of non-Western music introduces a substantially different collection of issues. Query by Humming: A Survey Eugene Weinstein. People actually hum very badly.


Significant features:
  • Repeated segment
  • Singing (identified by variation between high-frequency consonant and low frequency vowel).
  • Automated extraction of music snippets Lie Lu, and Hong-Jiang Zhang, Multimedia '03.

    Automatic Generation of Music Thumbnails, Tony Zhang and R. Samadani, 2007 IEEE Intl. Conf. on Multimedia and Expo