Linguistic DNA is a collaboration by linguists, historians, and digital humanities specialists at the Universities of Sheffield, Glasgow and Sussex. Our aim is to explore the emergence and development of semantic concepts as they are realised in historic textual corpora through a combination of computational processing and data visualisation techniques. This poster outlines the project’s goals and methodology, describes the progress made towards those goals, and offers interim results. Prior to the era of big data, semantic research has relied on intuitive selection of concepts worthy for study and has drawn its evidence largely from canonical texts. The advent of large machine-readable textual collections opens the door to new methodologies for research in conceptual history, revolutionising our ability to extract information from such data sets. The Linguistic DNA project explores the use of these techniques for historical semantics, beginning with annotated corpora yet without a predetermined set of concepts to study, the intention being that through text processing and data visualisation, concepts should emerge ‘bottom-up’ out of collections that extend far beyond the canonical texts.
The main source of data for analysis is the Early English Books Online collection (henceforth EEBO) 1 and Eighteenth Century Collections Online (ECCO). 2 EEBO has been manually transcribed to high accuracy levels by the Text Creation Partnership (TCP) whilst ECCO is partly manually transcribed and partly OCR’d. All are to be annotated with lemma and part of speech information, although our process takes different inputs, and begins with cleaned text. These collections together consist of English-language material printed between the 15th and 18th centuries. Initial stages of investigation involve the development of a software tool or suite of tools to query the data and provide input for visualisation. The success of the visualisations is then evaluated by the project’s team of research associates, investigating patterns which emerge, seeking verification of these patterns through returning to the textual source material, and using the resulting insights as input for iterative improvement of the querying and visualisation processes.
In the first year of the project, development of the query software has begun with assessment of the text for potentially challenging features, such as the difficulty posed by pre-standardisation spelling, inconsistent transcription practices, and atypical syntax. Also essential has been identification of the pre-processing required before the texts are analysed, and investigation of existing software packages that might be adapted and extended to meet our research goals. S-Space and BlackLab are examples of tools which might form part of a new pipeline, the components and algorithms of which will be developed by iterative experimentation. The processor will take account of different statistical measures starting with Pointwise Mutual Information, collecting data for a range of proximity windows to assess semantic relevance through distributional semantics techniques. Groups of words with strong patterns of association are output, which are then investigated as candidate concepts. In later stages the project will also use versions of the textual data annotated with sense codes based on the Historical Thesaurus of English. 3 This facilitates disambiguation of the senses of homographs, as well as offering another means of assessing relationships between words. To maximise the ‘bottom-up’ approach to data analysis, the LDNA processor initially indexes and runs queries on every word in the corpus. This avoids presupposing concepts or key terms a priori.
Further evaluation will be conducted through the lens of three ‘research themes’. Research Theme 1, led by Professor Susan Fitzmaurice at the University of Sheffield, will contextualise the emergence and development of concepts within the historical situations which have instigated and shaped them. Research Theme 2, led by Dr Justyna Robinson at the University of Sussex, will investigate where the boundaries of concepts lie and the families of words which delineate and cross these boundaries. Research theme 3, led by Dr Marc Alexander at the University of Glasgow, will explore moments of rapid change in the lexical items used to instantiate concepts. The research themes begin their work once initial processor development has taken place, and case studies with preliminary findings will be included on the poster. The project runs until 2018.
The Linguistic DNA project is funded by the Arts and Humanities Research Council (project AH/M00614X/1).
Kay, Christian, Jane Roberts, Michael Samuels, Irené Wotherspoon, and Marc Alexander (eds.). 2015. The Historical Thesaurus of English, version 4.2. Glasgow: University of Glasgow. http://historicalthesaurus.arts.gla.ac.uk/.