Experiment: Do Google searches on 50 English words of mid-range frequencies (accumulated, alive, ancestor, bushes etc.) (a) in entire Web (b) restricted to English. In every case, the pages count for (b) was just about exactly 56% of the page count for (b). First, this is an impossibly small percentage; Yahoo gives about 92% for most of these which is much more plausible. Second, it makes no sense that they should all be 56%. Almost identical results for French (58% rather than 56%, and slightly more variation, but not nearly enough.
I tried a small version of the experiment myself (12/9/07). Clearly things have changed though not uniformly for the better.
|Word||Search Web||Search English|
Veronis' explanation for his own observations: The Google database is divided into two parts: the part that has been indexed and the part that hasn't. Conjecture that (a) 56% of the total database has been indexed; (b) almost all the documents containing the test queries are, in fact, in English; (c) searches across the entire web extrapolate from the indexed part to the unindexed part; language-limited searches just use the indexed part. Then that would explain the observation.
[Corpora-list] problems with Google counts and subsequent discussion.
Some more anomalies, suggested by these discussions (12/9/07)
Google search on "the" (language unrestricted) gives a page count of 503,000,000 pages.
Google search on "the" restricted to English gives a page count of 1,740,000,000 pages. Google search on "the the" (language unrestricted) gives 616,000,000 pages.
Cluster documents using "Shared nearest neighbors clustering" -- complicated bottom-up clustering algorithm.
Recall (# of pairs clustered together / # of pairs clustered together) = 0.75
Precision (# of pairs clustered together / # of pairs that should be clustered together) = 0.45.
Relations to be found: (These examples were all actually extracted by the system.)
Semantic patterns: (I omit tense variations)
|SEMANTIC RELATION    ||Surface Pattern|
|narrow similarity||X i.e. Y|
|broad similarity||X and Y|
|strength||X even Y|
|X and even Y|
|Y or at least X|
|not only X but Y|
|not just X but Y|
|enablement||Xed * by Ying the|
|Xed * by Ying or|
|antynomy||either X or Y|
|whether to X or Y|
|X * but Y|
|precedes||X * and [then/later/subsequently/eventually] Y|
The Structure of Collaborative Tagging Systems Scott Golder and Bernardo Huberman
User activity: Number of different tags employed by single user seems to follow inverse power-law (as expected) from a high of almost 900.
Number of bookmarks only weakly correlated with number of different tags.
Types of tags
Trends Most URL's receive most of their tags very quickly. Some lie dormant for a while, then suddenly become very popular. certain point
Strategies for lifelong knowledge extraction from the web Michele Banko and Oren Etzioni Task: ALICE creates a theory for a specified domain: Nutrition.
Buzzword: "Lifelong Knowledge Extraction"
Concept Discovery: Import classes, IS-A (subclass) relations from WordNet. Also as in KNOWITALL, find classes and IS-A relations by matching patterns in Web text e.g. "frult such as < y >" "buckwheat is an < x >". In this way, determine that buckwheat is a whole grain, gluten-free grain, fiber-rich food and nutritious food where these are newly created categories.
Use KNOWITALL to collect relations among individuals and small classes from
Generalize to larger classes.
E.g. KNOWITALL collects "Oranges provide Vitamin C", "Bananas provide a source of B vitamins", "An avocado provides niacin". Using the known facts that oranges, bananas, and avocados are fruit and that Vitamin C, B vitamins, and niacin are vitamins, deduce PROVIDE(< FRUIT >, < VITAMIN >)).
(Of course, it's not clear how the quantifiers are supposed to work here. It is certainly not true that
forall(F,V) fruit(F) ^ vitamin(V) => provide(F,V).What is probably closest to the truth is
forall(F) fruit(F) => exists(V) vitamin(V) ^ provide(F,V).but there is no indication how ALICE would figure that out.)
Have to be careful to avoid over-generalization: e.g. "Provide(< FOOD >, < SUBSTANCE >) or "Provide(< ENTITY >, < ENTITY >)"
Constructed 696 new generalizations.
78% were meaningful, true, and relevant.
6% were off-topic e.g. "Cause(< Organism >, < Disease >)".
9.5% were vacuous e.g. "Provide(< Food >, < Substance >)".
3.% were incomplete e.g. "Provide(< Antioxidant >, < Body Part >)
3.5% were false e.g. "BeNot(< Fruit >, < Food >)".
Observation: In a newsgroup, if X quotes Y, it is much more often to disagree than to agree (74% to disagree, 7% to agree, 19% off topic).
Therefore: Construct the graph whose nodes are posters and where there is an arc from X to Y if X quotes Y. Divide the set of posters into two disjoint classes F (for) and A (against) in such a way that the maximal number of arcs go from one class to the other.
This problem is known as "MAX-CUT" and is NP-complete, but there are approximation algorithms.
Experiment: Three newsgroups: abortion (2525 authors), gun control (2632 authors), and (1648 authors). immigration.
Two versions of the algorithms: the pure version and the "constrained" version where 50 random authors were manually labelled, and the maximal cut respecting that labelling was returned.
Compare text-based SVM and Naive Bayes: both useless.
PROACT SYSTEM learns to recognize human activities.
Two parts of the system:
Model of activity: Sequence of steps. Each step is characterized by (a)
a set of objects involved, labelled with probabilities;
(b) [optional] a characteristic duration (mean & standard deviation).
E.g. Making tea:
Step 1: teapot (0.7), stove (0.8), faucet (0.6).
Step 2: cup (0.9), teabag (0.6), water (0.6)
Step 3: milk (0.5), sugar (0.4).
Models extracted from web text. Collect directions for various activities from ehow.com, ffft.com, and epicurious.com. E.g.
Making tea:(Qy: Is the actually off a web site?)
- Fill a teapot from the faucet. Place kettle on the stove and boil
- Pour hot water into a cup, filling 3/4 of the cup. Immerse teabag in cup for two minutes and dispose of teabag.
- Add milk and sugar to taste
Probabilities : Prob. that object O is used in activity A is GoogleCount(O&A) / GoogleCount(A).
Human activity recognition
Label all objects in environment with RFID tags:
RFID tags ... can be attached unobtrusively to objects as small as a toothbrush. The tags are wireless and battery free. When queries by radio, the tag responds with a globally unique identifying using power scavenged from the querying signal ... They can be purchased off the shelf, cost roughly $0.40 each, have the form factor of postage stamps (including adhesive backing) and can withstand day-to-day use for years. PROACT deployment involves tagging tens to hundreds of objects in their environment ... Current trends indicate that within a few years, many household obejcts may be RFID-tagged before purchase, thus eliminating the overhead of tagging [Yuck -- ED].
Two readers: A fixed long-range reader and one attached to a special glove.
Human subject carries out a variety of activities in tagged environment wearing RFID reader glove. Match activity to most probable model.
Results: Out of a space of 66 models, the correct activity is identified 50% of the time.
Web Crawling Ethics Revisited: Cost, Privacy, and Denial of Service Mike Thelwall and David Stuart Ethical issues in web data mining Lita van Wei and Lamber Royakkers
Shaping the Web: Why the politics of search engines matters Lucas Introna and Helen Nissenbaum
Hate speech: At one point the top Google page returned on query "jew" was JewWatch.com, an anti-semitic site. When this was noticed, it led to intense Google bombing on the part of both camps (the pro-Jew camp created links to the Wikipedia article). At the moment (12/10/07) the top Google sites are, first, the Wikipedia article; second, JewWatch; third, Google's explanation of the issue. The question is
Copyright: Search engines collect and cache documents to which they have no copyright. Opt out option, but naive users posting a page on the web may not realize that they are opening themselves to this (also privacy issue).
Inadvertance permanence If a web-page author decides to delete or modify a page, it may survive in the Intenet Archive or elsewhere. (Of course, that's true of other media as well, but an internet author may well be less likely to realize it.)
Crawler costs: Crawlers impose a cost on network resources and servers. In extreme case, this can amount to a denial of service attack
The decision what to index and how to rank is anti-democratic and reinforces the existing power structure:
Granted all that, and granted that those involved in creating and running
search engines should be sensitive to these issues, the real questions, it
seems to me, are
(a) Is the Web, as mediated by search engines, more slanted toward the existing power structures than any other source of information (e.g. print and broadcast media)?
(b) Is there any alternative way to structure search engines to alleviate these issues? Or would the state of things be better if there were no search engines?
Even if all the authors are acting in good faith and are each justified in publishing their information, the combination of information across information published by separate authors may lead to violations of privacy.
The collection and analysis of large amounts of data, even if it is public, can be a violation of privacy. E.g. it is illegal for the FBI to infiltrate political/religious/social organizations, even if the organization is entirely open to the public. The same applies in cyberspace.
The characterization of personal information in terms of demographic categories and then the application of generalizations based on these to individuals can have results that are unfair, prejudicial, socially counterproductive, or illegal. Particularly with automated systems, this can happen in ways that no human being realizes (or can be held accountable for), and in ways that are hidden, subtle, and hard to detect in their effect.
The security guarantees on information that is really supposed to be private are way too weak.