This paper presents initial findings of the research project Making Sense of Illustrated Handwritten Archives and demonstrates the recognition capabilities of the MONK artificial intelligence system developed since the early 2000s at the institute of Artificial Intelligence and Cognitive Engineering (ALICE) at Groningen University (Van der Zant et al., 2008; Van Oosten and Schomaker, 2014a, Van Oosten and Schomaker, 2014b). In a period of four years (Q1 2016 – Q1 2020), our research project aims to produce an innovative and user-friendly online environment that combines both image and textual recognition, and allows an integrated study of fragmented historical heritage collections. Next to a short demonstration of MONK, we use this paper to outline how handwriting and image recognition helps to increase the value of illustrated handwritten collections by enriching and linking information that is now inaccessible and disconnected. By doing so it advances the state of the art in automated extraction, classification, and networking of knowledge from heterogeneous manuscript collections.
The MONK system uses shape-based feature vector methods that have very few assumptions concerning the content or style of the material. It avoids the traditional OCR approach (optical character recognition) which assumes that individual characters are essentially legible. That assumption only holds for a tiny fraction of handwritten material and a limited number of scripts. The only assumptions MONK makes are that pictorial and textual segments are separated by white spaces; and that the layout, of underlining, etc. in a specific document, is consistent throughout the document. In MONK, classification methods are used that allow for a fast bootstrap from single example instances (nearest-neighbor search) (Gast et al., 2013). With larger numbers of labeled examples, models can be computed, varying from nearest-centroids to support-vector machines and neural networks in a continuous learning process (Krizhevsky et al., 2012; Liu et al, 2015; Guo, in press). A challenging topic from the technical point of view is the relation between existing semantic knowledge (ontologies) and the statistically inferred semantics using Google’s word2vec and current deep-learning neural networks. Can the underlying structure and style in a collection of a common and realistic size be detected by such algorithms? Can the proposed enrichment system profit from generally available text corpora? The processing power required by the proposed architecture is substantial. For this project, algorithmic optimization of the image processing and recognition system is necessary in order to create the necessary speed and flexibility of the system for use by non-technical end users. In order to tackle this challenge the consortium will make use of the combined knowledge and expertise of ALICE in Groningen, and the Leiden Institute of Advanced Computer Science (LIACS), where multiple supercomputers and high performance computing experts are present.
Because of its visual approach, MONK can handle the diversity of material that we encounter in our use case and in historical collections in general: text, drawings, and images. MONK also does not require a language model nor fully transcribed samples to quickly assess the contents of an archive page. The human-in-the-loop approach of MONK is currently ‘label’ oriented, but will be enhanced by providing the user and the system with ontologies for bootstrapping the learning process. The system will understand handwritten corpora to such an extent that the visual and textual content on individual pages is categorized, determined and networked to other pages in the archive and external sources. To construct training examples for MONK, biologists and historians of science will manually label documents to the machine learning software by means of a human in the loop approach. In addition, a crowdsourcing approach will be used to further expand this corpus of examples. Our consortium will here build on the expertise of ALICE and Naturalis Biodiversity Center, the Leiden-based National Museum of Natural History. Eventually, the computer-assisted recognition of words and visual information on a page will thus allow users to search, filter and group arbitrary archive items and enables connections with external databases. Last but not least, MONK lays the groundwork for full transcription of any handwritten-illustrated archival collection.
The central use case of our research project is the collection of the Natuurkundige Commissie voor Nederlandsch-Indië (hereinafter NC). It is one of the top-collections of Naturalis. From 1820 to 1850, the NC charted the natural and economic state of the Dutch East Indies and returned a wealth of scientific data and specimens which are now stored in archives in the Netherlands and Indonesia. The collection comprises thousands of handwritten notes and drawings and tens of thousands biological and geological specimens. While these archival items have all been digitized, the individual pages in the notebooks, diaries and reports are not catalogued nor labeled separately. Many of the field notes combine different textual and visual elements on one page. Our short paper presentation is based on the processing of an initial set of understudied handwritten field notes which we carried out in early 2016. By doing so, we will demonstrate the efficiency of the MONK system and our approach.
Owing to the different ‘hands’ and languages used in the documents, links across handwritten field records and notes, drawings and specimens cannot be made in an efficient way. Our corpus contains material from at least seventeen different writers and the used languages range from German (Kurrentschrift) to Latin, French, and Dutch. The labels of related historic specimens only provide very general information on collection localities and collectors. Hence, the typical use case of a scholar wishing to retrieve information on a certain species, person, drawing, or collecting locality is limited. Owing to its sheer dimension and its weak structure, it is impractical to disclose and network this archive manually. Its current inaccessibility hampers research into Southeast Asian natural history and the history of (scientific) knowledge production. Knowledge extracted from the documents will be structured and served as Linked Open Data. This will allow interlinking of content and also enable interoperability with other cultural heritage resources, for example, the physical specimens obtained during expeditions, or other historically significant data collections from the same area.
The multi-layered character of the material makes it a perfect use case for developing a technologically advanced and usability engineered digital environment for interpreting and connecting illustrated-handwritten collections. In our consortium data scientists from the Universities in Leiden and Groningen work closely together historians of science from the University of Twente and taxonomy experts from Naturalis. Fueled by an investment from BRILL publishers in a national funding scheme, this project will not only result in the disclosure of the NC archive, but will also enable the integrated study of underexplored scientific manuscript collections and archives in general.