While there have been numerous efforts at framing the history of the Digital Humanities, no study has concretely characterized the extent to which Digital Humanities research is data driven (Gold and Klein, 2012; Schreibman, Siemens, and Unsworth, 2004; Nyhan, Flinn and Welsh, 2015; Terras, Nyhanand Vanhoutte, 2013). Debates related to this topic periodically crop up along the Hack/Yack divide, as recurrent waves of scholars reflect on the varied histories, projects, and positions that comprise the Digital Humanities (Nowiskie, n.d.; Ramsay, n.d.; Cecire, n.d.; Alvarado, 2012). While these debates will likely continue, it is clear that current theoretical and historical contextualization stand to benefit from a more granular evaluation. The benefits of this evaluation hold potential to shed light on data driven research practices across disciplines and academic ranks, distribution of this output by institution type and geographical location, relative research data accessibility, as well as illumination of the scope of data resources utilized to further Digital Humanities research, which in turn holds the potential to inform library efforts to augment Digital Humanities support with more nuanced focus on acquisition, preparation, and provision of data that is more readily usable to Digital Humanists (Bryson et al., 2011; Sustaining the Digital Humanities, n.d.; Rockenbach, 2013). In order to realize these benefits the present study focuses on Digital Humanities praxis that is expressly data driven and computationally contingent. The study of this praxis is achieved through analysis of nearly 500 articles drawn from seven years of Oxford University Press' Digital Scholarship in the Humanities (formerly Literary and Linguistic computing), seven years of Digital Humanities Quarterly (the full run of the journal), as well as the full run of the Journal of Digital Humanities.
In order to evaluate data praxis, it was necessary to come to a working definition of “data” scoped to the level of concrete usage patterns in the Digital Humanities. The conclusion that a particular article utilized source "data" was based on whether or not the material under analysis played a role in supporting research claims predicated on the affordances of the digital object itself. A close reading of a digital version of Jane Eyre therefore would not meet the criterion of data driven, but topic modeling Jane Eyre would, as this is a form of analysis that is uniquely possible given digital instantiation of the object under study. Articles which were understood more as reports on data-oriented research, rather than active analysis were typically excluded. Assessments, historiographies, and other meta-analysis of computational research represented elsewhere are not treated as data driven for the purposes of this study as the work in question can move forward without leveraging the affordances of a digital object. Even where these types of articles are held to not contain source data under a process of direct computational analysis or representation, they are still considered against a rubric of research data production. Research data is understood to encompass any non-rhetorical, primarily structural data generated as an output that is used to validate research findings (Federal Register Notice Re OMB Circular A-110, n.d.). This might include tabular data, computer code, or survey responses.
Research data production is evaluated in order to come to an understanding of how Digital Humanists provide or do not provide access to generated data they use to support their arguments. The authors sought to focus on this aspect of research given a growing movement by scholars, operating primarily outside of the Humanities, to make their data and code accessible to support reproducibility and transparency (Stodden, Leisch, and Peng, 2014). If research data was produced in a given article, the authors proceeded to evaluate whether or not it was accessible. Research data is only considered to be accessible if the data in question is made available in a format that is machine processable. Therefore, a table of research data or an image of a line graph included in an article as a JPG is not accessible because the format renders the data intractable. Furthermore, a subset of a larger set of data, mainly used to illustrate an aspect of an argument rather than providing access to the unmediated source dataset is held to be inaccessible. Collectively, researchers, librarians, and publishers can use this portion of the study to inform assessment of the extent to which current research and publication practice are in line with how the field aims to articulate the integrity of its research claims.
On the whole, article level analysis is supported by capturing up to 48 descriptive elements for each article in the target corpus. In aggregate this dataset captures the number of data sources used in a given article, the provider of the data, type of provider, data collection name, content type, format, extent, size, publication pre or post 1923, whether research data is produced, whether research data is accessible, the method of access provided if it is accessible, research data URL, research data type, and a range of demographic data that allows disciplinary characterization of data driven practices by scholars and students, the type of institutions they work in, and where in the world they work. Collectively this data will enable the Digital Humanities community to gain a concrete sense of what proportion of Digital Humanities Scholarship as represented in a core set of journals is data driven. This study indicates that current research and publication practice provide insufficient access to research data. Because the evaluation of a data driven article’s argument requires access to research data, this scarcity seems especially troubling. Additional pragmatic gains to be had from this study include ready access to all data sources utilized over the past 7 years in core Digital Humanities journals. Ready access to this data holds potential to increase awareness of data for Digital Humanities research and pedagogy in addition to informing library acquisition, preparation, and provision of data that can used to support the Digital Humanities. Through its concrete focus on data praxis, this study provides newly comprehensive insight into data driven practices across the Digital Humanities.