DH 2016 Abstracts

Introduction to Natural Language Processing

Brief Description

The application of computational tools to textual data is a growing area of inquiry in the humanities. From the culling of “Culturomics” via the 30 million document Google books collections, to the painstakingly detailed process of analyzing the text of Shakespeare’s plays to ascertain their ‘true’ creator, a wide range of techniques and methods have been employed and developed. Text analysis in the humanities has also garnered an impressive level of interest in the mainstream media. For example, a study analyzing the relationship of a professor’s gender to their teaching reviews and an overview of Franco Moretti’s ‘distant reading’ both recently appeared in the New York Times. The Atlantic featured an historical critique of the language used in the period drama ‘Mad Men’, where textual analysis of the script revealed departures from the standard American English spoken in the 1960s. The majority of this work, however, relies on techniques such as n-grams and bag-of-word models. Recent developments in computational linguists, which have attempted to mimic the complex process by which humans parse and interpret language, are finding increased use within the humanities.

This workshop will introduce the basic components of modern natural language processing. Techniques include tokenization, lemmatization, part of speech tagging, and coreference detection. These will be introduced by way of examples on small snippets of text before being applied to a larger collection of short stories. Applications to stylometric analysis, document clustering, and topic detection will be briefly mentioned along the way. Our focus will be on a high-level, conceptual understanding of these techniques and the potential benefits of using them over models commonly employed for text analysis within humanities research. We will also introduce open-source software that is available for a wide range of programming languages (i.e., Java, R, Python, Ruby, Perl) and applicable for parsing an increasingly large number of natural languages (i.e., English, French, Spanish, Chinese, German, Turkish, Arabic). The workshop is based on a chapter from the instructor’s book Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text (Spring, 2015).

Instructors

Taylor Arnold, is currently a lecturer in the department of statistics at Yale and senior scientist at ATandT Labs. His research focuses on the analysis of large, complex datasets and the resulting computational challenges. A particular area of focus is the sparse representation of highly structured objects such as text corpora and digital images. He is the technical co-director of the NEH funded project Photogrammar. Together with Lauren Tilton, he is the co-author of the text Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text.

Lauren Tilton, is a doctoral candidate in American Studies at Yale University. She is the Co-Director of Photogrammar (photogrammar.yale.edu) and Participatory Media (http://participatorymediaproject.org/). Research interests include 20th century U.S. history and visual culture as well as digital and public humanities. She is the co-author with Taylor Arnold of Humanities Data in R: Exploring Networks, Geospatial Data, Images, and Text. She will be joining the faculty at the University of Richmond as a Visiting Assistant Professor of Digital Humanities in Fall 2016.

Target Audience and Size

This workshop is accessible to participants from all backgrounds.

Brief Outline of Workshop

Introduction to NLP
Tokenization and Sentence Splitting
Lemmatization
Part of Speech Tagging
Dependencies
Named Entity Recognition
Coreference resolution
Overview and comparison of current software
- Stanford CoreNLP
- Apache OpenNLP
- spaCy.io