The HathiTrust Research Center (HTRC) aims to facilitate large-scale computational text analysis of the contents of the HathiTrust Digital Library (HTDL) through data services and analytical tools. We conducted a study of current and potential users of the HTRC to investigate how scholars integrate text analysis into their research. Our study aims to inform the development of HTRC services and also to generate deeper insights into scholarly research practices with large-scale digitized text corpora.
Studies on the use of digital content by humanities scholars, ranging from humanities cyberinfrastructure (ACLS, 2006) and patterns in scholarly practices (Brockman et al., 2001; Palmer and Neumann, 2002; Green and Courtney, 2015), to discipline-specific studies (Zorich, 2012; Babeu, 2011; Rutner and Schonfeld, 2011), reveal that scholars acquire and analyze digital content in multi-faceted ways. Several investigations particularly examine scholarly uses of digital tools (Frischer et al., 2006; Toms and O’Brien, 2008; Gibbs and Owens, 2012). Computational text analysis dates from the beginnings of humanities computing (Hindley, 2013), and the resources of the ARTFL Project (Argamon et al., 2009; Horton et al., 2009), MONK (Unsworth, 2011), Wordseer (Muralidharan and Hearst, 2013), Voyant and TaPOR (Rockwell et al., 2010), and Lexos (LeBlanc et al., 2013), among others, inform the current work of the HTRC to provide a secure computational and data environment for researchers to conduct analyses of content from the HathiTrust Digital Library.
Our study builds on an earlier user needs assessment conducted for the HTRC and its Mellon Foundation-funded Workset Creation for Scholarly Analysis project. That earlier study analyzed interviews and focus groups in order to identify capabilities needed in large text corpora to facilitate scholarly research use (Fenlon et al., 2014). These desired capabilities included the ability to create and manipulate collections as reusable datasets and research products, the ability to work at different units of analysis, and access to highly enriched metadata (Green et al., 2014; Fenlon et al., 2014).
Our present study especially builds upon that previous investigation by examining the text analysis research practices of current and potential users of the HTRC.
Our study’s primary goals are:
While the findings of this study specifically will inform the development of services to meet the needs of HTRC users, it also contributes broader insights into how to develop similar digital resources and research services for computational text analysis.
We conducted fifteen semi-structured interviews with students, faculty, researchers, administrators, and librarians who pursue work that includes text analysis, or have familiarity with text analysis methods. Some participants were recruited at professional conferences for digital humanities and libraries, while others were active in HTRC user group forums. Several of the interviewees had previously interacted with the HTRC, and most had experience with the HTDL. The participants were from various disciplines — including English, Anthropology, History, and Computer Science —and ranged from newcomers to digital humanities to long-time researchers.
We performed an initial analysis of the interview data through open coding and will continue detailed qualitative analysis using ATLAS.ti. Data was independently coded by the authors to ensure inter-coder reliability. While we are still actively analyzing interview data, we identified several preliminary themes discussed here. These themes include strategies for obtaining and managing data, research workflows and results, collaborations, and teaching.
Several respondents characterized text analysis research as being time-intensive in spite of the speed of computational tools. One interviewee noted, ‘It’s funny, often people think, “Oh we have it digitized, now it’s useful.” Scholars realize that you have a lot more work to do after that. And that can often slow projects down terribly.’
The interviewees indicated that gathering, managing, and manipulating text data comprised a considerable portion of their work. An interviewee explained, ‘I think the biggest challenge is data, getting good data to work with. I think people underestimate the problems and difficulties in doing that.’
Interviewees also expressed a desire for improved ways to identify and extract the content they needed, especially when navigating large-scale collections to find the volumes, pages, or passages relevant to a research project. As one interviewee remarked, ‘Even if you had somehow structured your texts, I would be saying, “What was left out? How do I bring it back in?”’
Several interviewees described the potential of text analysis to challenge previously held understandings of text, as differences between human and computational readings emerged. One respondent noted, ‘There are many cases in which the computer is at least as good—if not better—a reader than humans are. That’s very difficult for people to accept... sometimes the computer gets it right and it bears looking at that difference. So we kind of want to get that new ground truth on this kind of work.’
Many researchers highlighted the importance of interpretive work in understanding how the tools interact with the text, and characterized the interactions as dynamic. One respondent observed, ‘I yearn for workflows where the scholar could actually set their own tokenization rules.... It would be a way that we could create less language-specific [rules] or control the language specificity of the algorithm. I think that is the real need.’ Several respondents highlighted the importance of tools that flexibly fit into various stages of the research process, and also are accessible to users of different skill levels. Interviewees also suggested enhancements specific to the HTRC, which included expanded visualization capabilities, improved generation of statistics about text corpora, and better ability to handle languages other than English.
Interviewees repeatedly cited collaboration and research support, both virtual and in-person, as important. Many interviewees worked with digital humanities initiatives, and reported that their local resources ranged from limited technical support to well-resourced research centers. For some interviewees, online support communities— such as Digital Humanities Questions and Answers or Stack Overflow — also were significant.
Interdisciplinary collaborations between departments and across institutions emerged as the most prominent kind of partnership, but interviewees also noted the challenges that such collaborations pose. As one interviewee explained, ‘Collaborations between institutions: much more difficult. There’s money, there’s institutional blockages, and then anything over half a dozen people, it gets complicated very quickly. And so the people dynamics get very complicated.’ Some respondents noted that these collaborations affected their research practices and acquisition of research resources.
Interviewees reported that their collaborations with libraries ranged from non-existent to critical partnerships. Many saw the library as a key space because ‘the library is actually the one functioning interdisciplinary space on a university campus.’ Collaborations with the HTRC and digital repositories for working with data also were important to respondents.
Interviewees mentioned their active efforts and intentions to incorporate computational text analysis into their teaching. Some remarked on institutional constraints that make it difficult to incorporate computational tools into curricula. As one respondent explained: ‘I once imagined teaching a class in which students learn to script and actually run analyses against data, but I was told, basically, that that class isn’t a humanities class anymore—that belongs in computer science.’
Some stated that the courses that they currently teach may not require or allow for the incorporation of computational analysis. Yet others noted that there is only a limited amount of technical or scientific skills that a humanities student could realistically master within a short period of time, with one interviewee noting that ‘you can only get people to learn so much about the math; as much as they can learn, they should — at the same time, it’s hard.’
Although the demand from students for learning about computational text analysis was, overall, reported to be increasing, some interviewees noted that they are constrained by not only limited resources, but also uncertainty as to how to carry out such activities. One interviewee reported prevailing sentiments that the digital humanities ‘doesn’t even fit anywhere,’ leading to the question of whether ‘there should be a whole separate department that’s digital humanities,’ or to offer training within existing curricula.
The immediate aims of this study are to generate an updated framework of user requirements that will guide the development of the HTRC’s educational programming and research support services and also to inform forthcoming Mellon Foundation-funded development of the HTRC Data Capsule. But our preliminary findings also provide insights into scholars’ needs as they increasingly incorporate text analysis in research and teaching. These findings also reveal how digital scholarship centers, information professionals, and providers of digitized content can best support scholarship as digital humanities resources evolve.
We thank Megan Senseney, Angela Courtney, Nicholae Cline, and Leanne Mobley for their collaboration in this study.