The Multi-Lingual Web

Lots of Languages

As of April 2011: There are 6909 known living languages according to Ethnologue, of which somewhat fewer than 1000 have a written form used by the native population (as opposed to linguists, anthropologist, missionaries, etc.) Parts of the Bible have been translated into 3168+ languages, the full Protestant canon into 457 languages (July 2010). Library of Congress advanced search supports 480 language categories (a couple of dozen of these are not individual real languages.) In Wikipedia, there are 280 languages. (Curiously, the language with the 4th greatest number of articles, after English, German, and French is Polish.) Arika Okrent's In the Land of Invented Languages lists 500 invented languages of varying degrees of completeness and reality, out of a list of more than 900 in Aleksander Dulichenko, International Auxiliary Languages 1990.

Of course, individuating languages is often vague, and often done more on political than linguistic grounds. Ethnologue: Languages of the World Online reference work for languages.

Estimating diversity

Estimating Linguistic Diversity on the Internet Peter Gerrand

Distinguish

Seems hardly worth saying except that people do confound them. We will only be concerned with the third.

Google (2011) indexes 45 languages (42 in 2007; it has added Afrikaans, Hindi, and Swahili). Only 11 are non-European. AllTheWeb (2003) indexed 48. Bing (2011) indexes 40.

Two estimates
1. Google search on language "without xcvb" (12/2007). Google no longer supports this (2011).

(Note: Google page counts are a very unreliable measure.

There seems to be evidence that the counts returned by Google vary substantially over time due to changes made to the index and the database of the search engine and depending on which Google server is accessed. Also, it has been observed that the Boolean operators supported by Google return unexpected results. For example, the Boolean query "Chirac OR Sarkozy" returns a lower page count than the simple query "Chirac", contrary to the logic of the OR operator. The negation operator has a similar unexpected effect, for example, the query "applesauce -aosdnao" returns more hits than "applesauce" on its own.
-- Web Based Models for Natural Language Processing , Mirella Lapata and Frank Keller

I verified both of these on Google 12/2/07. "Sarkozy" reports 5,240,000 pages, "Sarkozy OR Chirac" reports 4,480,000 pages.
"Applesauce" reports 287,000 pages, "applesauce -aosdnao" reports 4,370,000 pages.)

2. Estimate by Mas, based on AllTheWeb, 2003, cited in Estimating Linguistic Diversity on the Internet Peter Gerrand

All counts in thousands of pages

Language Google count Google % Mas count Mas %
Afrikaans --- --- 116 0.005
Albanian --- --- 53 0.003
Arabic 6,390 0.25 2,470 0.12
Armenian 1,770 0.072 --- ---
Basque --- --- 155 0.007
Belarusian 2,720 0.10 536 0.025
Bulgarian 2,820 0.11 1,120 0.053
Catalan 785 0.031 2,930 0.14
Chinese 353,000 14.5 65,700 3.1
Croatian 2,520 0.09 1,670 0.08
Czech 31,700 1.2 15,600 0.73
Danish 7,470 0.29 12,100 0.57
Dutch 23,600 0.97 41,100 1.9
English 1,430,000 58.6 1,280,000 60.4
Esperanto 357 0.014 --- ---
Estonian 875 0.035 1,460 0.069
Faroese --- --- 66 0.003
Filipino 382 0.015 --- ---
Finnish 5,610 0.22 5,680 0.27
French 65,500 2.68 99,700 4.7
Friesian --- --- 63 0.003
Galician --- --- 274 0.013
German 124,000 5.1 182,000 8.6
Greek 2,990 0.11 2,370 0.11
Hebrew 3,220 0.12 4,790 0.23
Hungarian 16,700 0.69 8,540 0.40
Icelandic 518 0.020 1,390 0.066
Indonesian 2,620 0.10 1,040 0.049
Italian 23,100 0.95 41,800 1.98
Japanese 96,400 4.0 69,700 3.3
Korean 16,300 0.67 64,600 3.0
Latin --- --- 137 0.006
Latvian 854 0.035 560 0.026
Lithuanian 1,860 0.077 1,080 0.053
Malay --- --- 348 0.015
Norwegian 4,610 0.18 8,120 0.383
Persian 1,110 0.044 --- ---
Polish 64,600 2.6 22,200 1.0
Portuguese 5,860 0.25 37,700 1.78
Romanian 6,740 0.27 2,000 0.097
Russian 47,600 1.9 42,300 2.0
Serbian 1,030 0.041 43 0.002
Slovak 4,380 0.17 5,080 0.24
Slovenian 983 0.040 1,690 0.08
Spanish 37,900 1.6 65,800 3.1
Swahili --- --- 14 0.001
Swedish 9,430 0.38 14,900 0.7
Thai 3,610 0.14 3,120 0.15
Turkish 23,500 0.97 4,700 0.22
Ukranian 990 0.040 1,010 0.048
Vietnamese 2,610 0.10 390 0.018
Welsh --- --- 93 0.004
Total 2,437,000 2,118,500

Note that the Google total is nothing like the number of pages Google presumably actually indexed in 2007 (at least 8 billion).

Of the 16 languages with the largest speaker populations (World Almanac), 5 were not on this list:
Hindi (180,000,000 speakers)
Bengali (170,000,000 speakers)
Javanese (75,000,000 speakers
Telugu (69,000,000 speakers)
Marathi (68,000,000 speakers)
(In 2011, Hindi is indexed by Google. For Hindi and Telugu, see Kilgariff et al. below),

In 1997 and 2002 two experiments were carried out by pinging randomly generated IP addresses and testing the language of the home page. Unlike any of the other measures here, that is the only one that is genuinely random; the rest are based on search engines that have who knows what kind of bias. However, it only measures what it measures, not the fraction of Web pages. As far as I can find out, no such experiment has been carried out since.

You would think that UNESCO would do this kind of thing. But their recently paper Twelve years of measuring linguistic diversity on the Internet: Balance and Perspective" by Daniel Pimienta, Daniel Prado, and Alvaro Blanco, discusses only eight languages --- English, Spanish, French, Italian, Portuguese, Romanian, German, and Catalan -- and is more interested in scoring political points.

Web Search in Other Languages

Current research issues and trends in non-English Web searching Fotis Lazarinis, Jesús Vilares, John Tait and Efthimis N. Efthimiadis Information Retrieval Volume 12, Number 3, 230-250, DOI: 10.1007/s10791-009-9093-0

Document processing

Language identification: Easy to get 100% accuracy in distinguishing French from English. Difficult for (a) languages which are very similar; (b) languages where there is only a small corpus.

Alphabetic encoding: Very boring, very difficult. Multiple encodings for the same language. 95% of Indian language content is not searchable, largely for this reason. Many encodings are proprietary. In Greek, the upper case letters get confounded with the corresponding Latin upper case letters, which is fine, until you regularize to lower-case; "ΑΒΑΚΑΣ" gets regularized to "abakaσ" (mixture of Latin and Greek) instead of "αβακασ". File names tend to be in Latin alphabet, because operating systems are less enlightened than search engines; this makes havoc for image search, which relies substantially on file names.

Just cutting and pasting "Jesús" from the biliographic entry into this HTML file got into trouble; it came out as "Jesús"

Dealing with Unicode is apparently one of the major timesinks in international web service development of any kind.

Text segmentation: Discussed in lecture 2. Even in English, breaking at white space and punctuation is not a reliable rule. In other languages, much worse. (A) No white space between words (Chinese, other Asian languages). (B) Heavy use of compound words (Dutch, German, etc.). One alternative that is been suggested is just to give up on words altogether and allow search on arbitrary sequences of letters.

Stopword list, stemming, regularization Discussed in lecture 2. Of course each language presents its own issues, and the effectiveness of different strategies is variable.

Crawling: One could imagine that a crawler specialized for a particular minority language could do a better job than Google, but the presumption is mostly that that is not the case. Most efforts to collect corpora for minority languages piggyback off Google.

Queries

Google gives three ways of e.g. getting pages in German. Moreover, Google offers its interface in 145 languages (a few of these are not languages; e.g. "Elmer Fudd"). It has 193 "Local Interfaces" corresponding to country codes.

Lewandowski (2008) in "Problems with the use of web search engines to find results in foreign languages", states that if you use www.google.de, and restrict search to English language documents, only 46% of the results are actually in English. MSN search was worse, at 34%; Ask.com and Yahoo were fine, at 94% and 95%. Hard to understand why this should be so bad. I was not able to replicate in a few tries.

Many studies of search in non-English languages. Two major questions:

Collecting Corpora for Languages

The Crubadan Project: Corpus building for under-resourced languages Kevin Scannell

Collected corpora for 416 living languages from the Web.

Crubadan 2.0: Status Page Kevin Scannell

Only a very small number (perhaps thirty) of the world's 6000+ living languages enjoy the benefits of modern language technologies such as speech recognition and machine translation. A slightly larger number (less than 100) have managed to assemble the basic resources needed as a foundation for advanced end-user technologies: monolingual and bilingual corpora, machine-readable dictionaries, thesauri, part-of-speech taggers, morphological analyzers, parsers etc.

Metadata
Name of language in English, ISO 693-3 code (standard 3 letter identifier for lang.) polluting languages (e.g. Spanish is a polluter of Basque, English is a polluter by default).
Additional meta-data by screen scraping Ethnologue site.

Character set
Presumably there is a Unicode standard, but for many languages there are large numbers of non-standard usages (legacy or otherwise) -- see paper. Generally necessary to consult with speaker.

Training text:

Collect documents in language from Web Construct search engine query with random words in lexicon OR'd together then ANDed with one stop word.

This often gives high precision in retrieving docs in target language, e.g. 98% for Irish. Sometimes it is hard to find a good stopword that is not also common in other languags.

Download docs, convert to plain text UTF-8.

Apply language recognizer. Angle cosine between vector of trigrams.

If doc is in target language, then use as seed for crawler.
If doc is in nearby language, use as seed for crawler for that language.
Otherwise, discard.

Expand lexicon for further queries.

Continue until either (a) enough document retrieved or (b) no new URLs, no new queries.

Issues

Applications

A Corpus Factory for many languages

Adam Kilgariff et al. LREC 2010.

Gather large corpora for eight languages: Dutch, Hindi, Indonesian, Norwegian, Swedish, Telugu, Thai, Vietnamese.

Seed word selection. Get Wikipedia corpus. Filter stubs etc. Tokenize. Except Vietnamese, exclude words of fewer than 5 letters (too likely to also be words of some other language). Use top 1000 words as "high-frequency" words and next 5000 words as "mid frequency words".

Cleaning the text: Again, a dull but major issue throughout this kind of work. Remove HTML markup, navigation bars, boilerplate. Boilerplate is identified as high ratio of tags to text at the beginning and end of a page (difficulties in the case of pages with boilerplate in the middle).

Eliminate word lists, identified as pages with too few function words. (Queries on 3 random mid-frequency pages tend to turn them up.)

Eliminate duplicates, near duplicates Perl Text::DeDuper module. Uses shingle method, as discussed earlier in the semester.

Results

URLs Filtered Dups Removed MB m Words Wiki Corpus
Dutch 97,584 22,424 19,708 739 108.6 30.0
Hindi 71,613 20 051 13,321 424 30.6 2.5
Indonesian 78,402 28,987 27,051 708 102.0 8.5
Norwegian 258,009 66,299 62,691 628 94.9 19.1
Swedish 168,511 31,683 28,842 719 114.0 9.3
Telugu 37,864 6,178 5,131 107 3.4 0.2
Thai 120,324 23,320 20,998 1200 81.8 6.2
Vietnamese 106,076 27,728 19,646 1200 149.0 9.5

As comparted to the Wiki corpus, the Web corpus is less "informational" and more "interactional" (personal pronouns "I" and "you" are much more frequent.)

Google for the Linguist on a Budget

András Kornai and Péter Halácsy, LREC Web as Corpus Workshop Proceedings, 2008.

Nice brief account of a (comparatively) light-weight crawler and query-server. However what is most interesting is the contrast in objectives with standard web search engines (i.e. those being used by people seeking information rather than linguistic phenomona):

System achieves 330 GBytes/day.

Collect a large corpus of documents in a language, for linguistic study, subject to the following constraints:

Collecting a parallel corpus (bitext)

The Web as a parallel corpus Philip Resnik and Noah A. Smith

Task: Collect pairs of web pages that are translations of one another in different languages.

Parent and sibling pages "A parent page page is one that contains hypertext links to different-language versions of a document."
Altavista query: (anchor:"english" OR anchor:"anglais") AND (anchor:"french" OR anchor:"francais"). Extract those in which these anchors are within 10 lines of one another.

"A sibling page is a page in one language that itself contains a link to a version of the same page in another language.
Request pages in French that match the expression "anchor:"english" OR anchor:"anglais".

Candidate pages at promising sites Having found a promising site --- e.g. from the search for parents and siblings or from known sites like government sites in Canada --- crawl the site. Then match URL's using manually constructed substitution rules. E.g. if you have an English-Chinese site and you have a URL
http://mysite.com/english/index.html then look for http://mysite.com/chinese/index.html.
If you have a URL http://mysite.com/index_en.html look for site http://mysite.com/index_ch.html. And so on.

Structural filtering The presumption is that (a) when the text on the page is translated, the HTML structure will be left more or less the same; (b) that the translations are of reasonably equal lengths.

Given a candidate pair of pages, linearize the HTML (matching strings is easier than matching trees) and use dynamic programming to find the optimal alignment. Characterize the quality of alignment in terms:

Results:
English/French: Precision=100%, Recall=68.6%. English/Chinese: Precision=98%, Recall=61%. English/Spanish: Precision=100%, Recall=60%

Comparable Corpora

Focused web crawling in the acquisition of comparable corpora Tuomas Talvensari et al., Information Retrieval 11:4, 2008.

Comparable corpus: Pairs of texts from different languages with similar vocabulary. E.g. News stories about the same event from different newspapers.

Weaker than parallel corpus, but attainable in specialized domains, which parallel corpora generally aren't.

PanDictionary

Panlingual Lexical Translation via Probabilistic Inference Mausam et al. Artificial Intelligence Journal 174, 2010.

Lexical translation with application to image search on the web Oren Etzioni et al.

Cross-Language Information Retrieval

Issue a query in one language to get results in another. There is a substantial literature on the subject. I don't understand why it is important.

The Linguists Playground

The Web in Theoretical Linguistics Research: Two Case Studies using the Linguist's Search Engine Philip Resnik et al. (Offline temporarily as of 12/3/07).

Toy example: Look for sentences where "titrates" is used intransitively.

User inputs sample sentence "John titrates the solution".
Program generates and displays parse tree.
User edits parse tree.. Program interprets as query (S NP (VP (VBZ titrates+VERB) (! NP)))
(I.e. "Look for any clause (S) with an NP followed by a VP where the verb is any form of "titrates" and where there is no NP as object.")

LSE has an existing corpus of several million parsed sentences.
Alternatively, the user can request that it:
search the web under "titrate",
extract all sentences containing "titrate",
parse and index these sentences,
return sentences satisfying the specification.

There are actually some: "You titrate with sulfuric acid using gloves and safety glasses" etc.

Linguistically interesting example 1: Comparative correlative with "then"

A "comparative correlative" is the construction like "The more pizza Romeo eats, the fatter he gets". This is thought to be problematic for Chomsky's theory of Universal Grammar --- don't ask me.

However the argument loses weight (I think --- I may be backwards on this) if examples of the construction can be found where the second part is introduced with "then". Some examples are indeed found on the web:

"The darker the coffee bean, then the less caffeine."
"The more playing time in the past, then the less regression to the mean needed"

Psycho-linguistically interesting example 2: For purposes of designing an experiment, it was desired to collect natural examples of sentences like "It [was/is/seemed ...] [clear/obvious/necessary ...] to [him / other NPs]

Googleology is bad science Adam Kilgariff, ACL, 2006.