Review of Special Issue of Artificial Intelligence Journal on AI and Wikipedia, Vol. 194`No. 1. Edited by E. Hovy, R. Navigli, and S.P. Ponzetto. Review #CR140920, Computing Reviews, Feb. 8, 2013.

Wikipedia is one of the noblest manifestations of the spirit of the Information Age. It is also one of the most astonishing. If you had predicted in 1985 that in twenty years a significant fraction of the population of the earth all over the globe would have at their fingertips access to a free, high-quality, multilingual encyclopedia, of a size dwarfing the Britanica, built largely by volunteer labor, with no institutional support and no advertisements, this would have much more far-fetched than the existence of self-driving cars or voice-activated personal assistants.

Wikipedia has also become a major resource for artificial intelligence (AI) research of all kinds, because it combines a number of very useful features.

  1. It is an immense, high quality, multi-topic corpus.
  2. The articles contain explicit presentations of basic information (in contrast for example to corpora of news articles, which almost always assume this information.)
  3. It is semi-structured. First, the corpus is divided into article, each of which deals with one specific concept. Second, there exist a number of different features, including hyperlinks, infoboxes, and category pages, that present information in a form that is more nearly standardized, and therefore much easier for automated systems to interpret, than free text. This feature is the focus of the collection under review.
  4. It is multi-lingual; there are Wikipedias in almost 300 different languages. In many languages that are underrepresented on the Web in general, Wikipedia is one of the largest online high-quality corpora.
  5. It was created and by a vast number of users with minimal knowledge of computer technology, in contrast to hand-crafted ontologies and knowledge bases such as CYC [2], which are crafted slowly by expensive experts.

Taking advantage of these features, AI researchers have used Wikipedia as a data resource for a wide range of applications including semantic relatedness, disambiguation, co-reference resolution, metonymy resolution, query expansion, multilingual retrieval, question answering, entity ranking, text categorization, and ontology and knowledge-base construction. There are now standard AI tasks that are defined in terms of Wikipedia, such as wikification (associating terms in a text with the corresponding Wikipedia article) and the automated construction of infoboxes. A couple of knowledge bases that have been automatically built from Wikipedia, particularly YAGO [4] and DBPedia [1], are now themselves widely used tools.

Medelyan et al. [3] is an exceptionally comprehensive and well-written survey paper of work in this area up to 2008. Its bibliography includes about 150 research papers. Readers looking for an introduction to the area should certainly start there.

The papers in the collection under review discuss a variety of more recent projects. As in most such collection, the papers are uneven in quality and readability; it can be difficult for a non-specialist, such as this reviewer, to extract a big-picture sense of what has been accomplished and what are the overarching issues. (Regrettably, some of the major research groups in this area are not represented, including Etzioni's group at U. Washington.)

Three papers in the collection seemed to me particularly fine. The overview, "Collaboratively built semi-structure content and Artificial Intelligence: The story so far" by the editors is an excellent complement and update of Medelyan et al. [3], including both summaries of the papers in this collection and a general survey of work in the area. It presents a series of "Take-home messages", high-level conclusions that provide guidance for future research. "YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia" by J. Hoffart et al. shows how the knowledge base YAGO has be extended with spatial and temporal information derived from infoboxes. Particularly noteworthy is an appendix containing a collection of questions posed in English; the translation of each question into the query language of YAGO; and the quality of the result obtained. "An open-source toolkit for mining Wikipedia" by D. Milne and I.H. Witten describes the toolkit "Wikipedia Mines", which appears to be a very valuable resource.

Overall the collection is of very great value, both to researchers in the area and to readers with a general interest in AI. The editors and authors can be congratulated on an important contribution to the literature on this promising, highly active area of research.


[1] S. Auer et al.. "DBPedia: A nucleus for a web of open data," International Semantic Web Conference, 2007, 722-735.

[2] D. Lenat, M. Prakash, and M. Shepherd, ``CYC: Using Common Sense Knowledge to Overcome Brittleness and Knowledge Acquisition Bottlenecks,'' AI Magazine, 6(4), 1985, 65-85.

[3] O. Medelyan et al., "Mining meaning from Wikipedia", International Journal of Human-Computer Studies, 67 (2009) 716-754.

[4] F.M. Suchanek, G. Kasneci, and G. Weikum, "YAGO: A core of semantic knowledge," WWW 2007, 697-706.