Preparing sources for historical research usually requires making many heterogeneous collections digitally accessible and linking them to compose a multi-faceted and multi-layered resource that supports both distant reading and close reading forms of analysis. In the lifecycle of historical information – a model introduced in 2004 - the Dutch DH-experts Boonstra, Breure and Doorn emphasize three points that should be kept in mind by e-science experts and researchers alike to keep historical information systems alive and useful: durability, usability and modeling (Boonstra et al., 2006: 22).
Timbuctoo, developed at the Huygens ING, offers a system that makes it possible to model and store heterogeneous data but also incrementally enrich and link the data. Furthermore, it also offers facilities to document the provenance of all data as well as all steps in data editing, extraction and linking data. These features are vital for historical research in which researchers need to be able to exert 'source criticism' and go back to the original source or data at all times (Ockeloen et al., 2013).
We will demonstrate the solution Timbuctoo offers with the Migrant, Mobilities and Connection project as a use case, because of its complex and multiple links to datasets from a myriad of cultural heritage institutions (archives, libraries and museums) on several levels.
The main focus of the Migrant project is on the life courses of the migrants. Starting from the limited core data from a connection between (digitized) Dutch emigrant cards and Australian immigration files, the life courses will be elaborated using in depth-analysis of these and other collections. It is important to note that for the purpose of the Migrant project that life courses not only comprise dates and birth, marriage, migration, education and employment but also extend to the interactions of migrants with all sorts of institutions in the Netherlands and Australia and their representatives. The database therefore enables us to analyse and compare the evolution of a multitude of social networks (Arthur et al. (submitted); Van Faassen, 2014a, 2014b).
The Migrant, Mobilities and Connection project focuses on the Dutch-Australian post World-War II migration from the Netherlands to Australia. Like all migrants these 180,000 people have left many traces in different cultural heritage collections ranging from (supra) government archives to the photo and memorabilia collections of the migrant families themselves and anything in between. These collections are dispersed over different countries. A lot of the collections are available in a digital form or will be digitized in the future, but like all historical collections they contain partially structured information and partially unstructured information that needs to be made accessible for further analysis. In elaborating the data we will use a variety of methods ranging from computer assisted data extraction and linking of a large collection of life events to hand editing of handwritten registration cards and personal migrant files. From an analytic perspective all these collections and edited data can be seen as different layers that need to be accommodated by the data store in which they are kept (Hoekstra and Nijenhuis, 2012).
Timbuctoo is a data repository system aimed at humanities research with the aim of linking together datasets containing structured information concerning people, places and organisations without actually merging them to facilitate scientific analysis and discussion.
To accomplish this it defines a number of primitive types that describe entities that all researchers agree on, such as the afore mentioned persons, places, organisations as well as works, languages, concepts and events. Each research project that makes use of the repository can extend the primitive types with extra fields. On top of that the repository has the ability to store multiple viewpoints on the same entity. In this way, researchers become aware of the different or sometimes even conflicting assumptions about entities, fueling scholarly debates in a conceptual way. Timbuctoo also support versioning and provenance. To make it clear on which information the results are based every change made to the data should have information who made the change, when the change was made and for what reason the change was made. The user interface, analysis and visualization are completely separated from the storage of the data. All services are coupled using REST (Fielding, 2000) APIs. The software is freely available under an open source license (GPL 3) and is published on Github.
During the project data will be added, edited and analysed continuously. As indicated above, at the beginning of the project the data consists of migration information contained in cards, files and information of governance agencies. Apart from the core information already available in a simple database, the cards and files contain much more information that must be digitized to be able to use it for analysis. In the course of the project, a lot of other materials from archives, libraries and other collections from different cultural heritage institutions will be added to the Timbuctoo database. Some of the information will be structured, but most is contained in typed or handwritten files and in images. The aim of information extraction is to extract structured information from unstructured information.
For the elaboration of this wealth of materials we will use an eclectic mix of editing and information extracting methods. Previously, hand editing was the only option for these types of materials, but in light of the amount of material that will be collected, all computer assisted information extraction that is possible will contribute to the database and help analysis.
To automate this process the data needs to be stored in such a way that a context can be build. The computer can search for patterns and suggest links to data already present in the network or calculate statistics to point out interesting or unusual things in the dataset. To begin with, an algorithm needs to be made to link the Dutch and Australian records together. Note that the system should not actually merge the records. Data about persons can be linked together based on (for example) familyname, year and place of birth and indeed all other types of structured information available such as migration date, migration scheme, ship with which they travelled or still very different data depending on the source and the context.
Other examples of computer assisted data extraction include the recognition of certain keywords in the facsimile or transcriptions. Researchers can add or edit information manually through the user interface, either manually or using algorithms. Automatic information extraction should suggest relations between different records or entities, but never actually enforce those changes on the researcher.
We started out to build Timbuctoo with the problem of the large variety of heterogeneous data sources that our institute produced in the course of a hundred years of classical and some twenty-five years of digital source editing and publishing. The use case of the Migrant, Mobilities and Connection project, with data about thousands of personal migrant stories scattered all over the world and its myriad of policy files on national and supranational levels recorded in different datasets, demonstrates the different features of Timbuctoo.
First, Timbuctoo is used as a repository where researchers as well as automatic tools such as parsers store all the heterogeneous data and the relations between it. Since all the data is versioned and provenance information is recorded, it is always clear where the data originates from. Second, Timbuctoo serves as a data source for researchers and parsers. Named entity recognition tools can use all the available names of places, organisations and persons as training data. Researchers can do queries on certain properties and do statistical analysis on the results to either find outliers or confirm or refute a hypothesis on a larger scale in more varied ways than previously possible. Finally, being a graph database, Timbuctoo is a research tool that enables researchers to infer indirect relations from the numerous direct relations in the repository. This makes it possible to perform complex queries and conduct network analysis and visualization. These three features combined enable researchers to discover unexpected phenomena that can lead to new research and methodological questions.