And Now For Something Completely Different

Google page count inaccuracy

Web: Google's missing pages: mystery solved? Jean Veronis

Experiment: Do Google searches on 50 English words of mid-range frequencies (accumulated, alive, ancestor, bushes etc.) (a) in entire Web (b) restricted to English. In every case, the pages count for (b) was just about exactly 56% of the page count for (b). First, this is an impossibly small percentage; Yahoo gives about 92% for most of these which is much more plausible. Second, it makes no sense that they should all be 56%. Almost identical results for French (58% rather than 56%, and slightly more variation, but not nearly enough.

I tried a small version of the experiment myself (12/9/07). Clearly things have changed though not uniformly for the better.

Word Search Web Search English
accumulated 7,080,000 2,090,000
alive 17,100,000 15,300,000
ancestor 1,360,000, 1,340,000
bushes 1,370,000 1,190,000
embarrassed 2,040,000 1,460,000
haberdasher 152,000 146,000
ridiculed 481,000 4,160,000 [sic]

Veronis' explanation for his own observations: The Google database is divided into two parts: the part that has been indexed and the part that hasn't. Conjecture that (a) 56% of the total database has been indexed; (b) almost all the documents containing the test queries are, in fact, in English; (c) searches across the entire web extrapolate from the indexed part to the unindexed part; language-limited searches just use the indexed part. Then that would explain the observation.

[Corpora-list] problems with Google counts and subsequent discussion.

Some more anomalies, suggested by these discussions (12/9/07)
Google search on "the" (language unrestricted) gives a page count of 503,000,000 pages.
Google search on "the" restricted to English gives a page count of 1,740,000,000 pages. Google search on "the the" (language unrestricted) gives 616,000,000 pages.

Multilingual clustering

Multilingual document clusters discovery Benoit Mathieu, Romaric Besancon, and Christian Fluhr Generalize vector model by defining distance between documents vectors in different languages using online dictionary.

Cluster documents using "Shared nearest neighbors clustering" -- complicated bottom-up clustering algorithm.

Recall (# of pairs clustered together / # of pairs clustered together) = 0.75
Precision (# of pairs clustered together / # of pairs that should be clustered together) = 0.45.

Extracting Verb Relations

VERBOCEAN: Mining the Web for Fine-Grained Semantic Verb Relations Timothy Chdlovski and Patrick Pantel

Relations to be found: (These examples were all actually extracted by the system.)

Semantic patterns: (I omit tense variations)

SEMANTIC RELATION     Surface Pattern
narrow similarity X i.e. Y
broad similarity X and Y
strength X even Y
X and even Y
Y or at least X
not only X but Y
not just X but Y
enablement Xed * by Ying the
Xed * by Ying or
antynomy either X or Y
whether to X or Y
X * but Y
precedes X * and [then/later/subsequently/eventually] Y

Accuracy: 68%.

Collaborative Tagging

Thanks to Antonio Arranz-Matia for pointing this out to me.

The Structure of Collaborative Tagging Systems Scott Golder and Bernardo Huberman


User activity: Number of different tags employed by single user seems to follow inverse power-law (as expected) from a high of almost 900.

Number of bookmarks only weakly correlated with number of different tags.

Types of tags

Trends Most URL's receive most of their tags very quickly. Some lie dormant for a while, then suddenly become very popular. certain point

Theory Construction by Web Mining

Strategies for lifelong knowledge extraction from the web Michele Banko and Oren Etzioni Task: ALICE creates a theory for a specified domain: Nutrition.

Buzzword: "Lifelong Knowledge Extraction"

Concept Discovery: Import classes, IS-A (subclass) relations from WordNet. Also as in KNOWITALL, find classes and IS-A relations by matching patterns in Web text e.g. "frult such as < y >" "buckwheat is an < x >". In this way, determine that buckwheat is a whole grain, gluten-free grain, fiber-rich food and nutritious food where these are newly created categories.

Generalization: Use KNOWITALL to collect relations among individuals and small classes from Web test.
Generalize to larger classes.
E.g. KNOWITALL collects "Oranges provide Vitamin C", "Bananas provide a source of B vitamins", "An avocado provides niacin". Using the known facts that oranges, bananas, and avocados are fruit and that Vitamin C, B vitamins, and niacin are vitamins, deduce PROVIDE(< FRUIT >, < VITAMIN >)).

(Of course, it's not clear how the quantifiers are supposed to work here. It is certainly not true that

forall(F,V) fruit(F) ^ vitamin(V) => provide(F,V).
What is probably closest to the truth is
forall(F) fruit(F) => exists(V) vitamin(V) ^ provide(F,V).
but there is no indication how ALICE would figure that out.)

Have to be careful to avoid over-generalization: e.g. "Provide(< FOOD >, < SUBSTANCE >) or "Provide(< ENTITY >, < ENTITY >)"

Results Constructed 696 new generalizations.
78% were meaningful, true, and relevant.
6% were off-topic e.g. "Cause(< Organism >, < Disease >)".
9.5% were vacuous e.g. "Provide(< Food >, < Substance >)".
3.% were incomplete e.g. "Provide(< Antioxidant >, < Body Part >)
3.5% were false e.g. "BeNot(< Fruit >, < Food >)".

Network-Based Analysis of Newsgroup Opinion

Mining Newsgroups Using Networks Arising from Social Behavior. Rakesh Agrawal et al.

Observation: In a newsgroup, if X quotes Y, it is much more often to disagree than to agree (74% to disagree, 7% to agree, 19% off topic).

Therefore: Construct the graph whose nodes are posters and where there is an arc from X to Y if X quotes Y. Divide the set of posters into two disjoint classes F (for) and A (against) in such a way that the maximal number of arcs go from one class to the other.

This problem is known as "MAX-CUT" and is NP-complete, but there are approximation algorithms.

Experiment: Three newsgroups: abortion (2525 authors), gun control (2632 authors), and (1648 authors). immigration.

Two versions of the algorithms: the pure version and the "constrained" version where 50 random authors were manually labelled, and the maximal cut respecting that labelling was returned.

Compare text-based SVM and Naive Bayes: both useless.

Algorithm Abortion Gun Control Immigration
Majority assignment 57% 72% 54%
SVM 55 42 55
Naive Bayes 50 72 54
EV 73 78 50
Constrained EV 73 84 88

Mining models of human activities from the web

Mike Perkowitz et al.

PROACT SYSTEM learns to recognize human activities.

Two parts of the system:

Model of activity: Sequence of steps. Each step is characterized by (a) a set of objects involved, labelled with probabilities; (b) [optional] a characteristic duration (mean & standard deviation). E.g. Making tea:
Step 1: teapot (0.7), stove (0.8), faucet (0.6).
Step 2: cup (0.9), teabag (0.6), water (0.6)
Step 3: milk (0.5), sugar (0.4).

Model extraction
Models extracted from web text. Collect directions for various activities from,, and E.g.

Making tea:
  1. Fill a teapot from the faucet. Place kettle on the stove and boil
  2. Pour hot water into a cup, filling 3/4 of the cup. Immerse teabag in cup for two minutes and dispose of teabag.
  3. Add milk and sugar to taste
(Qy: Is the actually off a web site?)

Probabilities : Prob. that object O is used in activity A is GoogleCount(O&A) / GoogleCount(A).

Human activity recognition

Label all objects in environment with RFID tags:

RFID tags ... can be attached unobtrusively to objects as small as a toothbrush. The tags are wireless and battery free. When queries by radio, the tag responds with a globally unique identifying using power scavenged from the querying signal ... They can be purchased off the shelf, cost roughly $0.40 each, have the form factor of postage stamps (including adhesive backing) and can withstand day-to-day use for years. PROACT deployment involves tagging tens to hundreds of objects in their environment ... Current trends indicate that within a few years, many household obejcts may be RFID-tagged before purchase, thus eliminating the overhead of tagging [Yuck -- ED].

Two readers: A fixed long-range reader and one attached to a special glove.

Human subject carries out a variety of activities in tagged environment wearing RFID reader glove. Match activity to most probable model.

Results: Out of a space of 66 models, the correct activity is identified 50% of the time.

Ethics and Politics

Web links and search engine ranking: The case of Google and the query "jew" Judit Bar-Ilan

Web Crawling Ethics Revisited: Cost, Privacy, and Denial of Service Mike Thelwall and David Stuart Ethical issues in web data mining Lita van Wei and Lamber Royakkers

Shaping the Web: Why the politics of search engines matters Lucas Introna and Helen Nissenbaum

Ethics of Search Engines

Hate speech: At one point the top Google page returned on query "jew" was, an anti-semitic site. When this was noticed, it led to intense Google bombing on the part of both camps (the pro-Jew camp created links to the Wikipedia article). At the moment (12/10/07) the top Google sites are, first, the Wikipedia article; second, JewWatch; third, Google's explanation of the issue. The question is

Copyright: Search engines collect and cache documents to which they have no copyright. Opt out option, but naive users posting a page on the web may not realize that they are opening themselves to this (also privacy issue).

Inadvertance permanence If a web-page author decides to delete or modify a page, it may survive in the Intenet Archive or elsewhere. (Of course, that's true of other media as well, but an internet author may well be less likely to realize it.)

Crawler costs: Crawlers impose a cost on network resources and servers. In extreme case, this can amount to a denial of service attack

The decision what to index and how to rank is anti-democratic and reinforces the existing power structure:

Granted all that, and granted that those involved in creating and running search engines should be sensitive to these issues, the real questions, it seems to me, are
(a) Is the Web, as mediated by search engines, more slanted toward the existing power structures than any other source of information (e.g. print and broadcast media)?
(b) Is there any alternative way to structure search engines to alleviate these issues? Or would the state of things be better if there were no search engines?

Ethical issues in web mining

Granted that the authors of web pages have decided to publish the content on the web page, and therefore have implicitly allowed the world to read and use that content. However, the authors themselves may be committing violations of copyright, of privacy, of hate speech, of slander, of any of a number of other types of unethical/illegal forms of expression. Therefore the use of this information by any other person is likewise tainted and potentially dangerous. It may be difficult to guard against this in automated systems. Particular if the web miner and its strategies become known, then one could have "miner bombs" analogous to Google bombs.

Even if all the authors are acting in good faith and are each justified in publishing their information, the combination of information across information published by separate authors may lead to violations of privacy.

The collection and analysis of large amounts of data, even if it is public, can be a violation of privacy. E.g. it is illegal for the FBI to infiltrate political/religious/social organizations, even if the organization is entirely open to the public. The same applies in cyberspace.

The characterization of personal information in terms of demographic categories and then the application of generalizations based on these to individuals can have results that are unfair, prejudicial, socially counterproductive, or illegal. Particularly with automated systems, this can happen in ways that no human being realizes (or can be held accountable for), and in ways that are hidden, subtle, and hard to detect in their effect.

The security guarantees on information that is really supposed to be private are way too weak.