Pervasive terminology: The "semantic gap"; the difference between media features that are easy to compute from the content and the features of interest to the user.
Very broadly speaking, the state of the art: Use signal processing theory to define local features in the image/waveform (e.g. edgelets). Use high-powered supervised classification ML techniques to build and tune a classifier for the semantic features of interest. Evaluate.
Image Retrieval from the World Wide Web: Issues, Techniques, and Systems M.L. Kherfi, D. Ziou, and A. Bernardi ACM Computing Surveys 36:1, 2004. Much more readable.
Find duplicate images, given changes in format, resolution, cropping, merging, geometric transformation.
Method: Compute transformation-invariant image features of subregions of the image. Use "locality sensitive hashing" for approximate similarity retrieval.
Keywords for a given image file from:
Google generally does well with this, but reasonably often makes mistakes that can only be understood if you look at the embedding page.
Matching Words and Pictures K. Barnard et al. Journal of Machine Learning Research 2003.
User studies show a large disparity between user needs and what technology supplies (Armitage and Enser 1997, Enser 1993, 1995). This work make hair-raising reading --- an example is a request to a stock photo library for "Pretty girl doing something active, sporty in a summery setting, beach -- not wearing lycra, exercise clothes -- more relaxed in tee-shirt. Feature is about deodorant, so girl should look active -- not sweaty, but happy, healthy, carefree -- nothing too posed or set up -- nice and natural looking."Cite various studies of requests to image collections.
80 Million Tiny Images: A Large Dataset for Non-parametric Object and Scene Recognition A. Torralba, R. Fergus, and W. Freeman
With overwhelming amounts of data, many problems can be solved without the need for sophisticated algorithms.
32 x 32 color pictures are generally recognizable. Lower resolution does not work. Vector of 3072 dimensions (1024 pixels x 3 colors) = 3072 Bytes per image.
General idea: Collect from the web a vast collection of annotated images, and use nearest neighbors to classify.
Data Set Collection
D(I1,I2) = sumx,y,c [I1(x,y,c) - I2(x,y,c)]2.
DWarp(I1,I2) = minimize [over transformation T] RawDist(I1,T(I2)) where T is a combination of translation, scaling, and horizontal mirror.
DShift(I1,I2) further allows in X and Y of individual pixels by 5 pixels.
Use of Wordnet
Convert wordnet into a tree of terms by extracting the most common meaning of all the words, and using the hypernym (supercategory) relationship. Then when searching for a category, you can include all words that are subcategories; e.g. if looking for person, include "artist", "politician", "kid" etc.
Annotation . Collect nearest neighbors. Each image "votes" for its label plus all supercategories
Find 80 nearest neighbors, see how many are labelled "person" or (more usually) subcategory. Note: Better for pictures where the person is large (a) because easier to match (b) because label is more likely to refer to the person.
Person localization Extract multiple crops of the picture, renomalize to 32x32, see which crops match.
Scene recognition Collect votes among nearest neighbors for subcategory of "location" (e.g. "landscape", "workplace", "city" etc.)
Image colorization Given a grey scale image, find nearest neighbors in grey, apply average color.
Image orientation Try all rotations, find the orientation with the best match.
As many as 85% of the returned images may be visually unrelated to the intended category, perhaps arising from polysemes (e.g. "iris" can be iris-flower, iris-eye, Iris-Murdoch). Even the 15% subset which do correspond to the category are substantially more demanding than images in typical training sets --- the number of objects in each image is unknown and variable and the pose (visual aspect) and scale are uncontrolled.
Animals on the Web T. Berg and D. Forsyth
Animal images are particularly hard to identify (a) because they can adopt multiple poses, and are often seen from odd angles (b) because they have evolved to be camouflaged.
Learning Object Categories from Google's Image Search R. Fergus et al.
Harvesting Image Databases from the Web F. Schroff, A Criminisi, A. Zisserman
18 categories: Airplane, beaver, bike, boat, camel, car, dolphin, elephant, giraffe, guitar, horse, kangaroo, motorbike, penguin, shark, tiger, wrist watch, zebra.
Compare three downloading methods:
Filter out non-photographs based on image characteristics. Overall precision goes from 29% to 35%, number of in-class examples goes from 13,000 to 10,000. (Varies considerably across categories.)
Rank images in each category using surrounding text plus meta-data. Naive Bayes on various text features (file name, word within 10 of image link etc.)
Train on visual features (similar to Fergus') using SVM.
Results: At 15% recall getting overall 86% precision.
Scene Completion using Millions of Photographs James Hayes and Alexei Efros
Evaluation: Subjects evaluated doctored photos as real 37% of the time. Note however that subjects only evaluated real photos as real 87% of the time. 34% of doctored photos marked as fake within 10 seconds (as opposed to 3% of real photos).
A survey of browsing models for content based image retrieval Daniel Heesch Multimedia Tools and Applications 40:2, 2008, 261-284.
Advantages to browsing:
Five Approaches to Collecting Tags for Music Douglas Turnbull, Luke Barrington, and Gert Lanckriet, ISMIR 2008.
"Cold start" problem: An item that is not annotated cannot be retrieved.
"Strong labelling": each item is labelled with each feature. If a feature is missing, the item reliably lacks it. "Weak labelling": missing labels are nulls.
"Popularity bias": Popular items (the "short head") are more thoroughly annotated than unpopular ones (the "long tail").
Largely applies to any kind of tagging of media.
|Survey||custom-tailored vocabulary||small, predetermined vocabulary|
|high-quality annotations||human-labor intensive|
|Social tags||collective wisdom of crowds||create and maintain popular social network|
|open vocabulary||ad-hoc annotation, weak labelling|
|provides social context (?)||cold start, popularity bias|
|Game||collective wisdom of crowds||"gaming" the system|
|incentive for high-quality annotation||difficult to create a successful game|
|fast paced, rapid data collection||listen to clips rather than whole song|
|Web||large publicly available corpus||noisy annotations|
|no direct human involvement||missing long tail|
|provides social context||weak labelling|
|Content based||no cold-start, popularity bias||computationally difficult|
|no direct human involvement||limited by training data|
|strong labelling||solely audio content|
pandora.com . Tracks annotated by human experts. Estimate 20-30 minutes of human time per track. (Incidentally there is an interesting long list of things their licencing agreement does not allow them to do. )
Note: Gaming as a method of collecting annotation tags was invented by Luis von Ahn in The ESP game for labelling images. This was extremely successful, and von Ahn got a job at CMU, a MacArthur fellowship, etc. As far as I know, this level of success has not been achieved by any subsequent game for collecting annotation tags.
Content-Based Music Information Retrieval: Current Directions and Future Challenges M.A. Casey et al., Proceedings of the IEEE 96:4, 2008.
Tasks and applications :
Low-level features: See Carey et al. 672-674. Most are very technical. One that is interesting is "onset detection"; i.e. marking when a note begins. One would think this would be obvious in the audio signal but apparently not.
High-level features: Timbre, Melody, Bass, Rhythm, Pitch, Harmony (chord sequence extraction), Key, Structure (segmentation), Lyrics. The analysis of non-Western music introduces a substantially different collection of issues. Query by Humming: A Survey Eugene Weinstein. People actually hum very badly.
Automatic Generation of Music Thumbnails, Tony Zhang and R. Samadani, 2007 IEEE Intl. Conf. on Multimedia and Expo