UpCASE (Upload, Correct, Annotate, Search and Export) is an open source web application 1 that provides researchers and lay people a wide range of options for viewing, evaluating, editing and enriching text collections. It is conceived as a multifunction web application where users can upload text documents in different formats, including scanned texts whose characters are automatically recognized by Optical Character Recognition (OCR). While being able to search the uploaded text, users can correct, annotate and also export it into different formats.
UpCASE is the result of years of work with Romansh texts and lexical resources. Romansh 2 is the smallest of the four national languages in Switzerland with approximately 50.000 native speakers (Furer, 2005). Despite several actions by official organizations, e.g. the ratification of the European Charter for Regional and Minority Languages, 3 Romansh is on a continuous retreat and is by now considered an endangered language. The main motivation of our work was to create a suite of tools to support the Romansh language community in building and maintaining language resources for Romansh. However, UpCASE is not restricted to be used within this particular context. 4 Our interest here is to show the relevance of using tools like UpCASE for small languages in general, since in the European Union alone there are more than 100 languages, many of them having little or no support by official institutions. 5
We present UpCASE together with a specific historical text collection, namely the Romansh Chrestomathy (RC) compiled by Caspar Decurtins (Decurtins, 1888-1919). The RC comprises texts from four centuries reflecting the different idioms of Romansh 6 and is recognized as the most important historical text collection of the Romansh language (cf. Egloff and Mathieu, 1986: 7). It contains approximately 7500 pages covering a wide range of different topics, text types and genres, and therefore is an excellent basis for the compilation of a text corpus. All in all, the RC can be seen as a monument for language, speakers, and culture of Romansh in Switzerland, and as such constitutes an exception for small linguistic and cultural communities (Rolshoven, 2012).
The RC text corpus was created in two successive projects funded by the German Research Fund (DFG). In the first project, the RC was digitized and its characters recognized by OCR. The main objective was to correct the OCR output to provide a digital full text version of the RC. Due to the characteristics of the RC as a multilingual historical text collection of a small language, with varying orthographical standards and almost no digital lexical resources available, the correction of OCR errors could not be solved in a fully automatic manner (Rolshoven, 2012). Instead, we implemented a web-based editing tool allowing native speakers to participate in the task of OCR correction. 7 In a follow-up project the corrected texts were enriched with part-of-speech (POS) tags. As in the treatment of OCR errors, a fully automatic POS-tagging procedure could not be applied to the RC (Neuefeind, 2013). First we compiled a lexical resource by digitizing lexica and generating inflected word forms. On this basis, we approached the linguistic annotation with a semi-automatic procedure, combining lexical lookup (resulting in mostly ambiguous tags) with manual correction and supervision, thus adapting the collaborative methodology from the DRC-project (Mondaca and Atanassov, 2016).
UpCASE brings together the experiences of both projects, combining the key features of collaborative corpus construction, enrichment and maintenance in a single web application. While existing tools mostly focus on a particular use case like collaborative correction (e.g. Wikisource 8) or collaborative annotation (e.g. WebAnno 9), we connect these functionalities in a full workflow from raw to enriched text. UpCASE is based on several state-of-the-art web technologies such as Spring WebMVC, Spring Data, JAXB, and JQuery. The data is persisted in a document-oriented NoSQL database (MongoDB), whose records are structurally similar to JSON objects. The use of JSON enables a straightforward communication with other resources. Using predefined REST (Representational State Transfer) interfaces, distributed language systems may be used for data enrichment, without the need of complex data conversion. 10
Using robust and scalable software on server side, and lightweight, clean and interactive components on client side, UpCASE offers different views in order to improve its usability. There are options to treat the collection as a whole, e.g. for searching, statistics and exporting, or to modify the data at hand, e.g. to edit, annotate or enrich. After importing text documents (or scanned images of texts which are OCR’ed), the text is indexed with Lucene and made accessible through an editable directory tree together with a full text search access. The stats view offers some basic statistical information about the text collection. In the export view, the user can choose different formats, e.g. plaintext or XML, to export the whole collection or parts of it. At the document level, each token is represented by a clickable widget, which opens a modal window containing different views – depending on user rights – associated with specific functions, e.g. editing, correction or annotation. The edit view allows the user to modify the text, e.g. to correct errors produced in the OCR process. The view contains both the editable word form and the relevant part of the scanned image with its highlighted position. The annotation view allows the user to create annotations like POS-tags on the fly, thus allowing complex searches on the search view.
Our presentation gives an overview of UpCASE and its basic functions, focusing on features for corpus maintenance, extension and enrichment. While in the first place we present a Romansh language resource, the concepts and features of this use case can be transposed to other text collections and languages. Thus UpCASE can be seen as an approach to technologically and methodologically support the preservation of the cultural heritage of regional and minority languages.