The Multi-Lingual Web

Ethnologue: Languages of the World Online reference work for languages.

Estimating presence

Google (2007) indexes 42 languages. AllTheWeb (2003) indexed 48

Two estimates
1. Google search on language "without xcvb" (12/2007).

(Note: Google page counts are a very unreliable measure.

There seems to be evidence that the counts returned by Google vary substantially over time due to changes made to the index and the database of the search engine and depending on which Google server is accessed. Also, it has been observed that the Boolean operators supported by Google return unexpected results. For example, the Boolean query "Chirac OR Sarkozy" returns a lower page count than the simple query "Chirac", contrary to the logic of the OR operator. The negation operator has a similar unexpected effect, for example, the query "applesauce -aosdnao" returns more hits than "applesauce" on its own.
-- Web Based Models for Natural Language Processing , Mirella Lapata and Frank Keller

I verified both of these on Google 12/2/07. "Sarkozy" reports 5,240,000 pages, "Sarkozy OR Chirac" reports 4,480,000 pages.
"Applesauce" reports 287,000 pages, "applesauce -aosdnao" reports 4,370,000 pages.)

2. Estimate by Mas, based on AllTheWeb, 2003, cited in Estimating Linguistic Diversity on the Internet Peter Gerrand

All counts in thousands of pages

Language Google count Google % Mas count Mas %
Afrikaans --- --- 116 0.005
Albanian --- --- 53 0.003
Arabic 6,390 0.25 2,470 0.12
Armenian 1,770 0.072 --- ---
Basque --- --- 155 0.007
Belarusian 2,720 0.10 536 0.025
Bulgarian 2,820 0.11 1,120 0.053
Catalan 785 0.031 2,930 0.14
Chinese 353,000 14.5 65,700 3.1
Croatian 2,520 0.09 1,670 0.08
Czech 31,700 1.2 15,600 0.73
Danish 7,470 0.29 12,100 0.57
Dutch 23,600 0.97 41,100 1.9
English 1,430,000 58.6 1,280,000 60.4
Esperanto 357 0.014 --- ---
Estonian 875 0.035 1,460 0.069
Faroese --- --- 66 0.003
Filipino 382 0.015 --- ---
Finnish 5,610 0.22 5,680 0.27
French 65,500 2.68 99,700 4.7
Friesian --- --- 63 0.003
Galician --- --- 274 0.013
German 124,000 5.1 182,000 8.6
Greek 2,990 0.11 2,370 0.11
Hebrew 3,220 0.12 4,790 0.23
Hungarian 16,700 0.69 8,540 0.40
Icelandic 518 0.020 1,390 0.066
Indonesian 2,620 0.10 1,040 0.049
Italian 23,100 0.95 41,800 1.98
Japanese 96,400 4.0 69,700 3.3
Korean 16,300 0.67 64,600 3.0
Latin --- --- 137 0.006
Latvian 854 0.035 560 0.026
Lithuanian 1,860 0.077 1,080 0.053
Malay --- --- 348 0.015
Norwegian 4,610 0.18 8,120 0.383
Persian 1,110 0.044 --- ---
Polish 64,600 2.6 22,200 1.0
Portuguese 5,860 0.25 37,700 1.78
Romanian 6,740 0.27 2,000 0.097
Russian 47,600 1.9 42,300 2.0
Serbian 1,030 0.041 43 0.002
Slovak 4,380 0.17 5,080 0.24
Slovenian 983 0.040 1,690 0.08
Spanish 37,900 1.6 65,800 3.1
Swahili --- --- 14 0.001
Swedish 9,430 0.38 14,900 0.7
Thai 3,610 0.14 3,120 0.15
Turkish 23,500 0.97 4,700 0.22
Ukranian 990 0.040 1,010 0.048
Vietnamese 2,610 0.10 390 0.018
Welsh --- --- 93 0.004
Total 2,437,000 2,118,500

Note that the Google total is nothing like the number of pages Google presumably actually indexes (at least 8 billion).

Quite a number of languages drop in total number of pages retreived. Particularly startling are Korean, which drops by a factor of 4; Norwegian which drops by a factor of 2; German, which drops from 182 million to 124 million pages, and French, which drops from 99,700 to 65,500. Presumably none of these are real, but one wonders what if anything they correspond to, and what the real percentage is.

Some languages increase startlingly; e.g. Serbian increases by a factor of more than 20, and Vietnamese and Turkish increases by a factor of 5. Hard to know whether this is real or artifactual.

Of the 16 languages with the largest speaker populations (World Almanac), 5 are not on this list:
Hindi (180,000,000 speakers --- Crubadan records 3 documents),
Bengali (170,000,000 speakers --- 65 documents),
Javanese (75,000,000 speakers -- 64 document),
Telugu (69,000,000 speakers --- 5 documents),
Marathi (68,000,000 speakers --- 24 documents)

Collecting Corpora for Languages

The Crubadan Project: Corpus building for under-resourced languages Kevin Scannell

Collected corpora for 416 living languages from the Web.

Crubadan 2.0: Status Page Kevin Scannell

Only a very small number (perhaps thirty) of the world's 6000+ living languages enjoy the benefits of modern language technologies such as speech recognition and machine translation. A slightly larger number (less than 100) have managed to assemble the basic resources needed as a foundation for advanced end-user technologies: monolingual and bilingual corpora, machine-readable dictionaries, thesauri, part-of-speech taggers, morphological analyzers, parsers etc.

Metadata
Name of language in English, ISO 693-3 code (standard 3 letter identifier for lang.) polluting languages (e.g. Spanish is a polluter of Basque, English is a polluter by default).
Additional meta-data by screen scraping Ethnologue site.

Character set
Presumably there is a Unicode standard, but for many languages there are large numbers of non-standard usages (legacy or otherwise) -- see paper. Generally necessary to consult with speaker.

Training text:

Collect documents in language from Web Construct search engine query with random words in lexicon OR'd together then ANDed with one stop word.

This often gives high precision in retrieving docs in target language, e.g. 98% for Irish. Sometimes it is hard to find a good stopword that is not also common in other languags.

Download docs, convert to plain text UTF-8.

Apply language recognizer. Angle cosine between vector of trigrams.

If doc is in target language, then use as seed for crawler.
If doc is in nearby language, use as seed for crawler for that language.
Otherwise, discard.

Expand lexicon for further queries.

Continue until either (a) enough document retrieved or (b) no new URLs, no new queries.

Issues

Applications

Non-English queries

How do search engines handle non-English queries? A case study Judit Bar-Ilan and Tatyana Gutman (2003)

Examined query engines for Russian, French, Hungarian, and Hebrew -- interface for standard search engines and language-specific search engines.

Character set. Different alphabet (Russian, Hebrew). Diacritical marks and accents in French and Hugarian sometimes omitted. Chinese, Japanese, etc. present very complex, entirely language specific problems.

Inflection. E.g. a Russian verb may have up to 250 different forms.

Prefixes. E.g. "l'" in French, "ha" in Hebrew. Hebrew allows up to 3 prefixs in a row e.g. "v'-she-ha-universita" = "and that the university"

Vowels. Printed Hebrew doesn't have any. Therefore, many more homonyms (e.g. midbar vs. m'daber.)

Confusion with uppercase to lowercase terminal letters. ( Web retrieval systems and the Greek language: do they have an understanding? Fotis Lazarinis)

Results
Russian: Language specific search engines and AllTheWeb worked well, much better than Google or AltaVista.
Hungarian: Language specific search engines do significantly better than any of the general search engines.
French and Hebrew: Both language-specific and general search engines are lousy.

In general:
Lots of inconsistent and inexplicable results. Descriptions in "help" are unreliable.

Conclusion: An unsatisfactory state of affairs. Underdeveloped technology, understudied issue.

Cross-Language Information Retrieval

Technical issues of cross-language retrieval: a review Kazuaki Kishida

Matching strategies

Translation techniques

Disambiguation of words in query using the other words in query

Language processing issues:
Same as in foreign language query, except that now you have to worry about fit between languages.

Language fit issues
Not mentioned by Kishida:

Pseudo-relevance feedback

Brief background on relevance feedback

[Standard} relevance feedback: User does a query, gets documents back, marks some of them as relevant (or is observed clicking through to some of them). Words that are common in the relevant document are added to the query, and the modified query is reissued.

In a Boolean model, OR'ing them improves recall and AND'ing them improves precision. In a vector model adding these tends to improve both recall (because a new relevant document may match words in the marked documents which were not in the query, such as synonyms) and precision (because words associated with unintended meanings of the query words, or unintended classes of documents will tend to be disfavored).

[Ordinary] pseudo-relevance feedback or blind relevance feedback: There is no actual feedback from the user; rather, the common words in all the retrieved documents are added to the query and the query is reissued.

This can also often improve retrieval. Suppose that 50% of the documents retrieved are relevant to the intended subject and 50% are irrelevant, but deal with a variety of different subjects. (This is particularly apt to happen with multi-word queries.) Then the words that are common in many of the retrieved documents will be those that are actually relevant to the intended subject, and again, in a vector model, both recall and precision will go up in the modified query.

End of background.

In the CLIR context there are two forms of pseudo-relevance feedback: pre-translation (i.e. in the query language) and post-translation (i.e. in the target language).

Pre-translation: Issue query to search engine in query language (note that this assumes that a search engine and a corpus exists for the query language; not necessarily the case in CLIR), collect common query language terms from retrieved documents, add to query, translate.

Post-translation: Translate query into target language, proceed with pseudo-relevance feedback in target language as usual.

Pivot languages. If there are no translation resources from A to B but there are resources from A to C and from C to B, then translate query from A to C and then from C to B. Problem: Ambiguity and other translation difficulties multiply.

Multi-language retrieval: Given query Q, retrieve relevant documents in any of languages L1 ... Lk. Do separate queries in each language and combine. Combination problem analogous to the problem in meta-search, except that one may have different information (or make different assumptions) about the comparative features of the corpus and the searching in different languages as compared to the features of the different search engines.

One final very awkward question. The answer to the question "Why have the major search engines shown so little interest in CLIR?" is another question, "Why does a user want documents in a language he can't read?"

Collecting a parallel corpus (bitext)

The Web as a parallel corpus Philip Resnik and Noah A. Smith

Task: Collect pairs of web pages that are translations of one another in different languages.

Parent and sibling pages "A parent page page is one that contains hypertext links to different-language versions of a document."
Altavista query: (anchor:"english" OR anchor:"anglais") AND (anchor:"french" OR anchor:"francais"). Extract those in which these anchors are within 10 lines of one another.

"A sibling page is a page in one language that itself contains a link to a version of the same page in another language.
Request pages in French that match the expression "anchor:"english" OR anchor:"anglais".

Candidate pages at promising sites Having found a promising site --- e.g. from the search for parents and siblings or from known sites like government sites in Canada --- crawl the site. Then match URL's using manually constructed substitution rules. E.g. if you have an English-Chinese site and you have a URL
http://mysite.com/english/index.html then look for http://mysite.com/chinese/index.html.
If you have a URL http://mysite.com/index_en.html look for site http://mysite.com/index_ch.html. And so on.

Structural filtering The presumption is that (a) when the text on the page is translated, the HTML structure will be left more or less the same; (b) that the translations are of reasonably equal lengths.

Given a candidate pair of pages, linearize the HTML (matching strings is easier than matching trees) and use dynamic programming to find the optimal alignment. Characterize the quality of alignment in terms:

Results:
English/French: Precision=100%, Recall=68.6%. English/Chinese: Precision=98%, Recall=61%. English/Spanish: Precision=100%, Recall=60%

The Linguists Playground

The Web in Theoretical Linguistics Research: Two Case Studies using the Linguist's Search Engine Philip Resnik et al. (Offline temporarily as of 12/3/07).

Toy example: Look for sentences where "titrates" is used intransitively.

User inputs sample sentence "John titrates the solution".
Program generates and displays parse tree.
User edits parse tree.. Program interprets as query (S NP (VP (VBZ titrates+VERB) (! NP)))
(I.e. "Look for any clause (S) with an NP followed by a VP where the verb is any form of "titrates" and where there is no NP as object.")

LSE has an existing corpus of several million parsed sentences.
Alternatively, the user can request that it:
search the web under "titrate",
extract all sentences containing "titrate",
parse and index these sentences,
return sentences satisfying the specification.

There are actually some: "You titrate with sulfuric acid using gloves and safety glasses" etc.

Linguistically interesting example 1: Comparative correlative with "then"

A "comparative correlative" is the construction like "The more pizza Romeo eats, the fatter he gets". This is thought to be problematic for Chomsky's theory of Universal Grammar --- don't ask me.

However the argument loses weight (I think --- I may be backwards on this) if examples of the construction can be found where the second part is introduced with "then". Some examples are indeed found on the web:

"The darker the coffee bean, then the less caffeine."
"The more playing time in the past, then the less regression to the mean needed"

Psycho-linguistically interesting example 2: For purposes of designing an experiment, it was desired to collect natural examples of sentences like "It [was/is/seemed ...] [clear/obvious/necessary ...] to [him / other NPs]