A Tool for NLP-Preprocessing in Literary Text Analysis

The possibilities for widening the spectrum of research questions by adopting new computational methodology seem to be almost unlimited for literary scholars with considerable programming skills. Researchers with little or no such skills, however, have to rely on user-friendly tools. Simple word counts are still among the most common, and admittedly often very useful features used in computational text analysis. Usually, linguistic annotations are needed for using more complex features in the analysis of style or content of a literary text. For example, a researcher might want to investigate style in terms of syntactic preferences by applying stylometric analysis on part-of-speech tag n-grams, to run topic modelling on specific word types only or to characterize the way an author describes figures by extracting all the adjectives that refer to a named entity. All these examples require of the scholar to first extract linguistic information from the text and use that information to define complex features.

Computer linguists have developed several tools for the various tasks of natural language processing (NLP) that can automatically analyze a digital text and annotate it with such information. In the present spectrum of solutions for NLP tasks, there is a gap between tools for rather simple tasks and full programming frameworks which require significant programming skills. The one end of the spectrum is represented by WebLicht, 1 a web service that allows users to upload and process single files very comfortably. On the other end are GATE, 2 NLTK 3, BookNLP 4 and the Darmstadt Knowledge Processing Repository (DKPro). 5

DKPro provides a programming framework in which many such NLP tools can be combined into an analysis pipeline. The pipelining approach is especially useful, often even necessary, when one NLP tool needs the annotations provided by another NLP tool in advance for extracting more complex linguistic features. DKPro thus provides access to tools like sentence splitters, tokenizers, part-of-speech taggers, named-entity recognizers, lemmatizers, morphological analyzers and parsers in many languages. 6

While making NLP significantly easier by integrating many NLP tools into a single framework, the use of DKPro still requires a substantial knowledge of technologies like UIMA, Java and Maven. To further lower the skill threshold for literary scholars to use complex NLP output in computational text analysis, DARIAH-DE (the German branch of the European project Digital Research Infrastructure for the Arts and Humanities, funded by the German Federal Ministry of Education and Research) developed the DARIAH-DKPro-Wrapper (DDW). 7 The DDW bundles a pipeline with a set of commonly used NLP components into a java program to be executed with a single command. As DKPro in general, the wrapper provides transparent access to a whole set of different NLP tools which are downloaded as needed behind the scenes. Command line options and configuration files allow users a considerable degree of control over the pipeline and its components, giving partial access to DKPro functionality without requiring any programming knowledge. The DDW also solves the problem of different input and output formats of the tools, offering a unified access. Therefore, the DDW positions itself in between the two ends of the aforementioned spectrum: It runs locally, allows for the processing of multiple files and can be configured to a considerable extent to one’s own needs. Whereas the user of classical DKPro is a UIMA programmer, the DDW can be used by anybody who can copy a command into the command line. Nonetheless, the DDW in some cases offers more features than other more advanced solutions, as DKPro supports more tools and languages. It also integrates Stanford NLP and supports the highly efficient Treetagger. A list of components available for both DKPro and the DDW can be found of the DKPro project page.

Furthermore, the DDW stores its output in a tab-separated plain text format inspired by CoNLL2009. 8 The format provides information on paragraph id, sentence id, token id, token, lemma, POS, chunk, morphology, named entity, parsing information and more. This format can be comfortably accessed in common scripting languages for further analysis, i.e. it can be directly read as a dataframe object in R or Python Pandas; it can even be opened in a common datasheet editor like Microsoft Excel.

Scripts connecting the output format to popular text analysis tools like the R package stylo 9 are currently under development. Dariah also prepared some tutorials explaining how to use the wrapper and showing the use of the output format in research in three use cases. 10

This poster will present the DDW and its file format as a new and comfortable means of providing linguistic annotations, thus significantly lowering the threshold for using complex NLP-based features in computational literary analysis.

Notes
1.
2.
3.
4.
5.
6.
Not every kind of tool is available in all languages; it depends on the native support of the tools, not on the framework provided by DKPro.
7.
8.
9.
Eder, Maciej, Mike Kestemont, and Jan Rybicki. "Stylometry with R: a suite of tools." Digital Humanities 2013: Conference Abstracts. 2013. For the software see: https://sites.google.com/site/computationalstylistics/stylo
10.