Review of Language Technology for Cultural Heritage: Selected Papers from the LaTeCH Workship Series ed. Caroline Sporleder, Antal van den Bosch, and Kalliopi Zervanou, Springer 2011. Review #139550 in Computing Reviews, Nov. 2, 2011.
What is the value of computer technology to the humanities scholar studying texts? First and foremost, especially since the advent of the Web, access to an immense and ever-increasing quantity of digitized primary and secondary material. Second, search using text and metadata and browsing, both within collections and across the Web. Without getting up from her desk, a scholar can find articles that would earlier have taken weeks of library work to locate, or examine manuscripts in museums thousands of miles away.
Third, there is computational analysis over document collections. The range of purposes for which this is useful is incomparably narrower than the first two functionalities, though where it applies, it can be a very powerful tool. The central limitation is that, in the current state of the art and for the foreseeable future, reliable information can only be gotten about formal aspects of a text, not its meaning. Thus, one can use natural language processing (NLP) tools for valuable statistical information about linguistic usages, especially if there is a corpus of documents whose linguistic structure is tagged. One can attribute authorship of a text using statistical stylistic criteria. And so on. (Historians can do statistical study of historical data, but that is more akin to a social science analysis.) But these kinds of analysis have little value for most scholars of literature, history, philosophy, religion, etc.; and almost no value for non-scholarly common readers of the texts.
Thus, by far the most important accomplishments in applying computers to humanistic studies of text have been in access and search. The applications of cutting-edge computer science research to these purposes are limited. I am not at all saying that the goal of digitizing a document collection and constructing a high-quality web site for it is now attainable with off-the-shelf solutions; manifestly, it is not. However, reaching this goal is generally less a question of advanced computer technology than of thoughtful design involving the cooperation of domain specialists, librarian specialists, and computer people. Great web sites, like JSTOR or the Cambridge Shahnama site, do not, as far as I can tell, rely on especially powerful computer techniques; they are just superbly well designed.
As regards word-based search, the problem is that the individual researcher cannot compete with Google and Bing; it is almost always more effective to use an advanced Google search restricted to a site than to use the search engine provided by the site. (It may be possible to surpass Google for search in languages other than English; mixed results are reported CITATIONS.) The one aspect of developing web sites for document collections that is really fruitful for technical computer science research is using optical character recognition (OCR) and handwriting recognition to convert the documents to digital form.
Overall, my feeling is that, if your objective is the electronic dissemination, preservation, and enhancement of texts as ``cultural heritage'', then your budget is best spent by, first, putting a lot of content online, in searchable format, with reliably accurate metadata; second, creating high-quality web sites for important collections.
For this reason, most of the papers in the collection under review are either unconvincing or not about cultural heritage. The volume begins with a rather Delphic foreword by Willard McCarty, and a clear and helpful introduction by the editors. There are two papers on preprocessing; one on using OCR a collection of 19th century mountaineering yearbooks, and one on automatic alignment of images of manuscripts with their transcripts. These are legitimately text-focussed projects; they make the texts more available, or more usable.
The remaining papers deal with applications of NLP tools to text collections. The best of these are explicitly addressed to the interests of language specialists and linguists; the texts are primarily raw materials for linguistic analysis. For example, Borin and Forberg are working on a diachronic lexical resource for 800 years of Swedish. Bamman and Crane are creating a treebank --- , a corpus of texts with linguistic tagging -- for classical Greek and Latin literature.
The least convincing papers are those that hope to use NLP tools to get insights into content. For example, Reiter et al. propose to apply NLP tools to a corpus of documents dealing with religious rituals, both descriptive (by outsiders) and prescriptive (by priests). There is no reason to suppose that this is a particularly meaningful or interesting collection from the linguistic point of view; and even less reason to suppose that applying NLP tools to these texts will reveal anything important about the nature of rituals.
Finally, let me cry shame on the publisher (Springer): the last three papers are printed in 8 point type, while the rest are printed in 10 point type. I have rarely seen anything so sloppy in a printed book.
Acknowledgements: Thanks to Deena Engel and Emily Morton-Owens for helpful suggestions.