In this paper we present a technique to enable the historical study of ideas instead of words. It aimed at assisting humanities scholars in overcoming the limitations of traditional keyword searching by making use of context-specific dictionaries. The elaboration of this technique was the result of a successful collaboration between the History Department of Utrecht University (UU) and the Research Department of the Koninklijke Bibliotheek, National Library of the Netherlands (KB), executed by the authors of this paper during Huijnen's period as Researcher-in-residence at the KB in 2015.
The aim of this collaborative project was twofold: first, to create a method for dictionary extraction from a representative text corpus, based on existing methods and algorithms. Second, to find a way of executing dictionary searches in the KB's digitized newspaper archive and visualizing the results. Both components of the project were tested and evaluated by means of a case study on the impact of American scientific management theories in the Dutch public sphere during the first half of the 20th Century. Using the approach described here, we were able to discover and analyze shifts in the way the modernization of Dutch business and economy was discussed during this period. We would not have been able to achieve the same results by means of traditional historical scholarship alone.
Historical newspapers have traditionally been popular sources to study public mentalities and collective cultures within historical scholarship. At the same time, they have been known as notoriously time-consuming and complex to analyze. The recent digitization of newspapers and the use of computers to gain access to the growing mass of digital corpora of historical news media are altering the historian’s heuristic process in fundamental ways.
The large digitization project the Dutch National Library currently runs can illustrate this. Until now, the KB has made publicly available over 80 million historical newspaper articles from the last four centuries. Researchers (as well as the wider public) are able to do full-text searches in the entire repository of articles through the KB’s own online search interface Delpher ( http://www.delpher.nl/kranten). Instead of manually skimming through a selected numbers of editions or volumes this functionality allows for the searching of particular (strings of) keywords within the entire corpus. As basic as it may seem, full-text searching completely overturns the way in which historians are used to approach newspapers. Instead of the successive top-down selections historians traditionally made in order to gradually isolate potentially interesting material, keyword searching treats the corpus as a singular bag of words and, therefore, enables researchers to immediately dive into the texts that meet their search criteria (Nicholson, 2013).
At the same time, keyword searching has some serious shortcomings for the use in (cultural) historical research. Historians commonly work with texts, but are rarely interested in language per se. Rather, they use written or spoken sources (be it correspondence, literature, diaries, or news media) to gain access to past cultures, ideas, or mentalities. The things that historians are mostly interested in, are often not made explicit (e.g. the Enlightenment attitude, generational conflicts) and difficult to abstract into singular keywords (modernity, secularization). Doing historical research with keyword searching is like painting a canvas using felt-tip pens: it loses every inch of subtlety.
We have successfully developed a technique of dictionary extraction and searching to address this problem. The use of dictionaries is able to bring greater subtlety and diversity into digital historical scholarship. The more elaborate these dictionaries are, the more they overcome the contingency that comes with the use of singular keywords in search strategies. Several research projects that have incorporated the use of highly domain- and time-specific word-lists ('dictionaries'), have already shown this. Text classification algorithms, for example, have helped find the most obvious indicator words for articles about strikes in the Dutch newspaper corpus (Van den Hoven et al., 2010). Implicit dictionaries based upon the MALLET ( http://mallet.cs.umass.edu) package's topic modeling functionality has assisted in finding Darwinian motives in Danish literature (Tangherlini and Leonard, 2013). Topic modeling was also used in building a neoliberalism dictionary to study Colin Crouch's post-democracy thesis in German historical newspapers (Wiedemann et al., 2013; Wiedemann and Niekler, 2014).
From the wide variety of techniques scholars have developed to build and use dictionaries, this project found most inspiration in the topic modeling-based method of the ePol Projekt (Wiedemann and Niekler, 2014). However, rather than aiming at building an optimal infrastructure for dictionary extraction of our own, based on existing techniques, our project centered around practical usability. We sought to develop a (set of) tool(s) for working with dictionaries tailored to the computational expertise to be expected from, but also the specific needs of professional historians (and humanities scholars in general). One of the aims of the KB's Researcher-in-residence program, in addition, is that resulting tools and techniques are usable by the wider public searching the National Library's databases of historical newspapers, periodicals, and books. Our code is fully open source and can be found on GitHub ( https://github.com/jlonij/keyword_generator). The ways in which we have tried to meet the specific demands this posed, can, in our view, be seen as exemplary for any Digital Humanities project aimed not at building highly specialized tools for individual projects, but at combining scholarly standards with the goal of generic usability.
There are a number of ways we have accounted for the targeted user groups in the development of our dictionary extraction and search techniques. On the one hand, for example, we aimed at agility and flexibility at the expense of the deployment of exhaustive computational means. Our algorithm is able to extract a dictionary of flexible length from a given source input of text files within minutes. Because the technique is intended for exploratory use, it is essential that iterations and experimentations are stimulated. Requiring too many preprocessing steps or demanding too much time would be counterproductive.
On the other hand, meeting the demands of tool criticism was crucial in every step of this project. Therefore, the risk of blackboxing was avoided wherever we could, while at the same time granting the user-expert as much control as possible. By varying the command, users decide over the segmentation of the source corpus, the number of topics to be generated, the number of words to be contained per topic, as well as the number of dictionary words required. Moreover, users may flexibly choose between Gensim's ( https://radimrehurek.com/gensim) and MALLET's implementation of LDA, as well as a straightforward tf-idf implementation. When making use of one of the topic modeling packages, users are, just before the process of dictionary generation, given the option of excluding any number of (irrelevant) topics of choice from the equation.
An evaluation in terms of generic precision and recall for any of the variables is, in our view, contradictory to the principle of flexibility. Instead, we evaluated and improved the dictionary extraction by comparing automatically generated dictionaries with ones that were built manually, based on domain knowledge. Comparing the results of searches with different dictionaries in the KB’s digitized newspaper archive was used as an additional evaluation method: dictionaries could be compared in terms of the ranking of some key articles about a particular topic, since the archive's Solr ( http://lucene.apache.org/solr) search engine scores the results of an OR-query (the search string, in which we expressed the dictionaries) on the basis of the number of times query words appear in an article, amongst other things. The case study that was used to test, evaluate, and apply the tools and techniques under development was the impact of American scientific management ideas in the Dutch public media before WWII.
On the basis of this case study we would in our presentation like to show how our implementation of dictionary extraction, search, and visualization can assist the scholarly historical study of digital corpora in general. By visualizing the search results from different dictionaries we are able to show shifting discourses in historical news media. Plotting the number of articles containing a user-specified number of words from any given dictionary, we can present trends in discourse-specific vocabulary usage over time. Whereas existing historiography, for example, suggests a continuing use of scientific management vocabulary in the Netherlands since its introduction in the 1910s, our project presents a more differentiated picture. Dictionary searches in the KB's newspaper corpus show how the use of words in public media connected to the sphere of scientific management (based on context- specific literature) waned after the WWII and how they made room for a new vocabulary belonging to a new era.
At the same time, this case study illustrates how digital techniques like ours bring about conceptual innovations in the study of history. After all, our case study shows that (combinations of) ordinary words (in this instance, for example, 'time', 'work', or 'supervision') are more distinguishing to trace discursive discontinuities than the 'big' words (like 'taylorism' or 'neoliberalism') that historians traditionally have focused on.