Find duplicate images, given changes in format, resolution, cropping, merging, geometric transformation.
Method: Compute transformation-invariant image features of subregions of the image. Use "locality sensitive hashing" for approximate similarity retrieval.
Keywords for a given image file from:
Google generally does well with this, but reasonably often makes mistakes that can only be understood if you look at the embedding page.
Matching Words and Pictures K. Barnard et al.
User studies show a large disparity between user needs and what technology supplies (Armitage and Enser 1997, Enser 1993, 1995). This work make hair-raising reading --- an example is a request to a stock photo library for "Pretty girl doing something active, sporty in a summery setting, beach -- not wearing lycra, exercise clothes -- more relaxed in tee-shirt. Feature is about deodorant, so girl should look active -- not sweaty, but happy, healthy, carefree -- nothing too post or set up -- nice and natural looking."Cite various studies of requests to image collections.
80 Million Tiny Images: A Large Dataset for Non-parametric Object and Scene Recognition A. Torralba, R. Fergus, and W. Freeman
With overwhelming amounts of data, many problems can be solved without the need for sophisticated algorithms.
32 x 32 color pictures are generally recognizable. Lower resolution does not work. Vector of 3072 dimensions (1024 pixels x 3 colors) = 3072 Bytes per image.
General idea: Collect from the web a vast collection of annotated images, and use nearest neighbors to classify.
Data Set Collection
D(I1,I2) = sumx,y,c [I1(x,y,c) - I2(x,y,c)]2.
DWarp(I1,I2) = minimize [over transformation T] RawDist(I1,T(I2)) where T is a combination of translation, scaling, and horizontal mirror.
DShift(I1,I2) further allows in X and Y of individual pixels by 5 pixels.
Use of Wordnet
Convert wordnet into a tree of terms by extracting the most common meaning of all the words, and using the hypernym (supercategory) relationship. Then when searching for a category, you can include all words that are subcategories; e.g. if looking for person, include "artist", "politician", "kid" etc.
Annotation . Collect nearest neighbors. Each image "votes" for its label plus all supercategories
Find 80 nearest neighbors, see how many are labelled "person" or (more usually) subcategory. Note: Better for pictures where the person is large (a) because easier to match (b) because label is more likely to refer to the person.
Person localization Extract multiple crops of the picture, renomalize to 32x32, see which crops match.
Scene recognition Collect votes among nearest neighbors for subcategory of "location" (e.g. "landscape", "workplace", "city" etc.)
Image colorization Given a grey scale image, find nearest neighbors in grey, apply average color.
Image orientation Try all rotations, find the orientation with the best match.
As many as 85% of the returned images may be visually unrelated to the intended category, perhaps arising from polysemes (e.g. "iris" can be iris-flower, iris-eye, Iris-Murdoch). Even the 15% subset which do correspond to the category are substantially more demanding than images in typical training sets --- the number of objecs in each image is unknown and variable and the pose (visual aspect) and scale are uncontrolled.
Animals on the Web T. Berg and D. Forsyth
Animal images are particularly hard to identify (a) because they can adopt multiple poses, and are often seen from odd angles (b) because they have evolved to be camouflaged.
Learning Object Categories from Google's Image Search R. Fergus et al.
Harvesting Image Databases from the Web F. Schroff, A Criminisi, A. Zisserman
18 categories: Airplane, beaver, bike, boat, camel, car, dolphin, elephant, giraffe, guitar, horse, kangaroo, motorbike, penguin, shark, tiger, wrist watch, zebra.
Compare three downloading methods:
Filter out non-photographs based on image characteristics. Overall precision goes from 29% to 35%, number of in-class examples goes from 13,000 to 10,000. (Varies considerably across categories.)
Rank images in each category using surrounding text plus meta-data. Naive Bayes on various text features (file name, word within 10 of image link etc.)
Train on visual features (similar to Fergus') using SVM.
Results: At 15% recall getting overall 86% precision.
Scene Completion using Millions of Photographs James Hayes and Alexei Efros
Evaluation: Subjects evaluated doctored photos as real 37% of the time. Note however that subjects only evaluated real photos as real 87% of the time. 34% of doctored photos marked as fake within 10 seconds (as opposed to 3% of real photos).
LabelMe: a database and web-based tool for image annotation B. Russell et al. Tool for users on the web to label images and parts of images. Objective to get a large corpus of images with (reasonably) high-quality textual labels.
Searching For Multimedia: An Analysis Of Audio, Video, And Image Web Queries B. Jansen, A. Goodrum, A. Spink. How users search for multimedia
Clustering Art K. Barnard, P. Duygulu, D. Forsyth. Cluster images on the San Francisco Art Museum web site by image characteristics and text labels.