Of course, individuating languages is often vague, and often done more on political than linguistic grounds. Ethnologue: Languages of the World Online reference work for languages.
Google (2011) indexes 45 languages (42 in 2007; it has added Afrikaans, Hindi, and Swahili). Only 11 are non-European. AllTheWeb (2003) indexed 48. Bing (2011) indexes 40.
1. Google search on language "without xcvb" (12/2007). Google no longer supports this (2011).
(Note: Google page
counts are a very unreliable measure.
There seems to be evidence that the counts returned by Google vary substantially over time due to changes made to the index and the database of the search engine and depending on which Google server is accessed. Also, it has been observed that the Boolean operators supported by Google return unexpected results. For example, the Boolean query "Chirac OR Sarkozy" returns a lower page count than the simple query "Chirac", contrary to the logic of the OR operator. The negation operator has a similar unexpected effect, for example, the query "applesauce -aosdnao" returns more hits than "applesauce" on its own.-- Web Based Models for Natural Language Processing , Mirella Lapata and Frank Keller
I verified both of these on Google 12/2/07.
"Sarkozy" reports 5,240,000 pages,
"Sarkozy OR Chirac" reports 4,480,000 pages.
"Applesauce" reports 287,000 pages, "applesauce -aosdnao" reports 4,370,000 pages.)
2. Estimate by Mas, based on AllTheWeb, 2003, cited in Estimating Linguistic Diversity on the Internet Peter Gerrand
All counts in thousands of pages
|Language||Google count||Google %||Mas count||Mas %|
Note that the Google total is nothing like the number of pages Google presumably actually indexed in 2007 (at least 8 billion).
Of the 16 languages with the largest speaker populations (World Almanac),
5 were not on this list:
Hindi (180,000,000 speakers)
Bengali (170,000,000 speakers)
Javanese (75,000,000 speakers
Telugu (69,000,000 speakers)
Marathi (68,000,000 speakers)
(In 2011, Hindi is indexed by Google. For Hindi and Telugu, see Kilgariff et al. below),
In 1997 and 2002 two experiments were carried out by pinging randomly generated IP addresses and testing the language of the home page. Unlike any of the other measures here, that is the only one that is genuinely random; the rest are based on search engines that have who knows what kind of bias. However, it only measures what it measures, not the fraction of Web pages. As far as I can find out, no such experiment has been carried out since.
You would think that UNESCO would do this kind of thing. But their recently paper Twelve years of measuring linguistic diversity on the Internet: Balance and Perspective" by Daniel Pimienta, Daniel Prado, and Alvaro Blanco, discusses only eight languages --- English, Spanish, French, Italian, Portuguese, Romanian, German, and Catalan -- and is more interested in scoring political points.
Current research issues and trends in non-English Web searching Fotis Lazarinis, Jesús Vilares, John Tait and Efthimis N. Efthimiadis Information Retrieval Volume 12, Number 3, 230-250, DOI: 10.1007/s10791-009-9093-0
Language identification: Easy to get 100% accuracy in distinguishing French from English. Difficult for (a) languages which are very similar; (b) languages where there is only a small corpus.
Alphabetic encoding: Very boring, very difficult. Multiple encodings for the same language. 95% of Indian language content is not searchable, largely for this reason. Many encodings are proprietary. In Greek, the upper case letters get confounded with the corresponding Latin upper case letters, which is fine, until you regularize to lower-case; "ΑΒΑΚΑΣ" gets regularized to "abakaσ" (mixture of Latin and Greek) instead of "αβακασ". File names tend to be in Latin alphabet, because operating systems are less enlightened than search engines; this makes havoc for image search, which relies substantially on file names.
Just cutting and pasting "Jesús" from the biliographic entry into this HTML file got into trouble; it came out as "Jesús"
Dealing with Unicode is apparently one of the major timesinks in international web service development of any kind.
Text segmentation: Discussed in lecture 2. Even in English, breaking at white space and punctuation is not a reliable rule. In other languages, much worse. (A) No white space between words (Chinese, other Asian languages). (B) Heavy use of compound words (Dutch, German, etc.). One alternative that is been suggested is just to give up on words altogether and allow search on arbitrary sequences of letters.
Stopword list, stemming, regularization Discussed in lecture 2. Of course each language presents its own issues, and the effectiveness of different strategies is variable.
Crawling: One could imagine that a crawler specialized for a particular minority language could do a better job than Google, but the presumption is mostly that that is not the case. Most efforts to collect corpora for minority languages piggyback off Google.
Lewandowski (2008) in "Problems with the use of web search engines to find results in foreign languages", states that if you use www.google.de, and restrict search to English language documents, only 46% of the results are actually in English. MSN search was worse, at 34%; Ask.com and Yahoo were fine, at 94% and 95%. Hard to understand why this should be so bad. I was not able to replicate in a few tries.
Many studies of search in non-English languages. Two major questions:
The Crubadan Project: Corpus building for under-resourced languages Kevin Scannell
Collected corpora for 416 living languages from the Web.
Crubadan 2.0: Status Page Kevin Scannell
Only a very small number (perhaps thirty) of the world's 6000+ living languages enjoy the benefits of modern language technologies such as speech recognition and machine translation. A slightly larger number (less than 100) have managed to assemble the basic resources needed as a foundation for advanced end-user technologies: monolingual and bilingual corpora, machine-readable dictionaries, thesauri, part-of-speech taggers, morphological analyzers, parsers etc.
Name of language in English, ISO 693-3 code (standard 3 letter identifier for lang.) polluting languages (e.g. Spanish is a polluter of Basque, English is a polluter by default).
Additional meta-data by screen scraping Ethnologue site.
Presumably there is a Unicode standard, but for many languages there are large numbers of non-standard usages (legacy or otherwise) -- see paper. Generally necessary to consult with speaker.
Collect documents in language from Web Construct search engine query with random words in lexicon OR'd together then ANDed with one stop word.
This often gives high precision in retrieving docs in target language, e.g. 98% for Irish. Sometimes it is hard to find a good stopword that is not also common in other languags.
Download docs, convert to plain text UTF-8.
Apply language recognizer. Angle cosine between vector of trigrams.
If doc is in target language, then use as seed for crawler.
If doc is in nearby language, use as seed for crawler for that language.
Expand lexicon for further queries.
Continue until either (a) enough document retrieved or (b) no new URLs, no new queries.
Gather large corpora for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai, Vietnamese.
Seed word selection. Get Wikipedia corpus. Filter stubs etc. Tokenize. Except Vietnamese, exclude words of fewer than 5 letters (too likely to also be words of some other language). Use top 1000 words as "high-frequency" words and next 5000 words as "mid frequency words".
Cleaning the text: Again, a dull but major issue throughout this kind of work. Remove HTML markup, navigation bars, boilerplate. Boilerplate is identified as high ratio of tags to text at the beginning and end of a page (difficulties in the case of pages with boilerplate in the middle).
Eliminate word lists, identified as pages with too few function words. (Queries on 3 random mid-frequency pages tend to turn them up.)
Eliminate duplicates, near duplicates Perl Text::DeDuper module. Uses shingle method, as discussed earlier in the semester.
|URLs||Filtered||Dups Removed||MB||m Words||Wiki Corpus|
Nice brief account of a (comparatively) light-weight crawler and query-server. However what is most interesting is the contrast in objectives with standard web search engines (i.e. those being used by people seeking information rather than linguistic phenomona):
System achieves 330 GBytes/day.
Collect a large corpus of documents in a language, for linguistic study, subject to the following constraints:
The Web as a parallel corpus Philip Resnik and Noah A. Smith
Task: Collect pairs of web pages that are translations of one another in different languages.
Parent and sibling pages
"A parent page page is one that contains hypertext links to different-language
versions of a document."
Altavista query: (anchor:"english" OR anchor:"anglais") AND (anchor:"french" OR anchor:"francais"). Extract those in which these anchors are within 10 lines of one another.
"A sibling page is a page in one language that itself contains a link to a
version of the same page in another language.
Request pages in French that match the expression "anchor:"english" OR anchor:"anglais".
Candidate pages at promising sites
Having found a promising site --- e.g. from the search for parents and siblings
or from known sites like government sites in Canada --- crawl the site.
Then match URL's using manually constructed substitution rules.
E.g. if you have an English-Chinese site and you have a URL
http://mysite.com/english/index.html then look for http://mysite.com/chinese/index.html.
If you have a URL http://mysite.com/index_en.html look for site http://mysite.com/index_ch.html. And so on.
Structural filtering The presumption is that (a) when the text on the page is translated, the HTML structure will be left more or less the same; (b) that the translations are of reasonably equal lengths.
Given a candidate pair of pages, linearize the HTML (matching strings is easier than matching trees) and use dynamic programming to find the optimal alignment. Characterize the quality of alignment in terms:
English/French: Precision=100%, Recall=68.6%. English/Chinese: Precision=98%, Recall=61%. English/Spanish: Precision=100%, Recall=60%
Comparable corpus: Pairs of texts from different languages with similar vocabulary. E.g. News stories about the same event from different newspapers.
Weaker than parallel corpus, but attainable in specialized domains, which parallel corpora generally aren't.
Lexical translation with application to image search on the web Oren Etzioni et al.
The Web in Theoretical Linguistics Research: Two Case Studies using the Linguist's Search Engine Philip Resnik et al. (Offline temporarily as of 12/3/07).
Toy example: Look for sentences where "titrates" is used intransitively.
User inputs sample sentence "John titrates the solution".
Program generates and displays parse tree.
User edits parse tree.. Program interprets as query (S NP (VP (VBZ titrates+VERB) (! NP)))
(I.e. "Look for any clause (S) with an NP followed by a VP where the verb is any form of "titrates" and where there is no NP as object.")
LSE has an existing corpus of several million parsed sentences.
Alternatively, the user can request that it:
search the web under "titrate",
extract all sentences containing "titrate",
parse and index these sentences,
return sentences satisfying the specification.
There are actually some: "You titrate with sulfuric acid using gloves and safety glasses" etc.
Linguistically interesting example 1: Comparative correlative with "then"
A "comparative correlative" is the construction like "The more pizza Romeo eats, the fatter he gets". This is thought to be problematic for Chomsky's theory of Universal Grammar --- don't ask me.
However the argument loses weight (I think --- I may be backwards on this) if examples of the construction can be found where the second part is introduced with "then". Some examples are indeed found on the web:
"The darker the coffee bean, then the less caffeine."
"The more playing time in the past, then the less regression to the mean needed"
Psycho-linguistically interesting example 2: For purposes of designing an experiment, it was desired to collect natural examples of sentences like "It [was/is/seemed ...] [clear/obvious/necessary ...] to [him / other NPs]
Googleology is bad science Adam Kilgariff, ACL, 2006.