DH 2016 Abstracts

Qu’est-ce qu’un texte numérique? A New Rationale for the Digital Representation of Text

Over half a century ago Briet (1951) famously asked "Qu'est-ce que la documentation?" She proposed a definition much more fluid than the scraps of writing on paper we usually associate with the term: any object, even an uninscribed stone, becomes a document as soon as it is used to communicate some fact (e.g. in a museum’s geological collection). In a similar vein we argue in this paper for a renewed consideration of the nature of text in the digital realm, or rather the nature that has been imposed upon it both by information-technological and methodological constraints.

Text is by its nature both discrete and continuous. The glyphs of a writing system are combined into words, which are arranged into phrases, sentences, paragraphs, quotations, and so on; these discrete formations can be arbitrarily complex, but they are communicated through the atom of the glyph. On the other hand, very many elements of the meaning of a text—e.g. narratological elements such as the theme of isolation in Do Androids Dream of Electric Sheep?, intertextual references and referents (Miola 2000), the concept of writerly text (Barthes, 1974), or even the visual presentations of individual glyphs, defy discrete boundaries. Yet the technologies we have used overwhelmingly for the large-scale production of text—from the printing press to signal transmissions and thence to digital models—treat text as a code of discrete characters; moreover, once text moves from “print” to “signal” (Petzold, 2000) it becomes specifically a single stream of these discrete characters. This can already present some problems to scholars who wish to remediate texts not coded by machine, e.g. manuscripts, to the digital domain for humanistic enquiry. To date scholars who encounter these problems generally consider them an acceptable trade-off for the benefits of access, exchange, and computational tractability that the digital medium offers.

If we want our digital model of a text to extend beyond the semiotic registration of a single stream of characters, however, problems emerge. McGann (2004) summarizes this well: "Print and manuscript technology represent efforts to mark natural language so that it can be preserved and transmitted. It is a technology that constrains the shapeshiftings of language, which is itself a special-purpose system for coding human communication. Exactly the same can be said of electronic encoding systems. In each case constraints are installed in order to facilitate operations that would otherwise be difficult or impossible.” Everyone who has ever worked on the remediation of a physically-inscribed text into the digital medium has implicitly conceptualized it at some point as a single stream of discrete characters, simply because this is how we have always known text, ultimately, to be represented digitally. Transcription is thus exactly the sort of constraint to which McGann refers: there is a single stream of text, and only once it exists can it be subdivided and classified as having certain descriptive, structural, or logical characteristics, or even relationships to other portions of text.

The model of text as a single stream of discrete characters also informs the concept of XML markup.¹ As Pierazzo (2011) notes, “The editor will first transcribe a primary source, thereby creating a transcription; this transcription will be corrected, proofread, annotated, and then prepared for publication.” The predominant model for the process of annotation to which Pierazzo refers is TEI-XML, that is, embedded markup within the hierarchical structure required by XML. Conceptually a tree structure expressed in XML can be perfectly reorderable, and so not bound to a single valid serialization. With text encoding, however, the ordering constraint remains and must be preserved, however marked up it is and subdivided into branches. Standoff markup schemes similarly assert a single stream—each scheme relies on the stability of the range notation it uses for the underlying text stream—but the markup itself is reorderable. This reorderable property of standoff markup takes us a small step closer to our goal: to recast and represent digital text as something more than a single stream of discrete tokens.

Standoff markup essentially applies a graph model. This is the feature of standoff that makes it so valuable to its proponents—in a graph, unlike a tree, overlapping ranges of markup are easily handled. Some of these standoff markup elements might even include alternative readings to a given text range, or a reference to a different text that elucidates the contextual meaning of the range in question. From there it is only a small step to the conception of a text that is itself more complex than a single stream. Such conceptions emerge already with the different implementations of variant graphs for collations of “multi-version documents” (Schmidt and Colomb, 2009; see also Dekker et al., 2015; Andrews and Macé, 2013; Jänicke et al., 2014). While so far the variant graph has been used only for word-by-word comparisons of text versions, the concept can be extended radically farther. We will demonstrate this, building on prior work in this direction (e.g. Marcoux, Sperberg-McQueen, and Huitfeldt, 2013). We will show in our paper, for instance, how a graph-based digital text can model collections of stories preserved in manuscripts, without giving primacy to any one ordering of the story sequence; that is, it can contain both the collated text of each story and the variety of story order within the tradition. Likewise the same narrative element, inserted into multiple stories, can be represented as such.

The representation of information as entities with relationships between them—in other words: as a graph—is not a new idea. RDF works exactly this way, and the theoretical ideas behind hypertext are not dissimilar (cf. Nelson, 1993). The ease with which complex graphs can be modelled in software, however, has massively increased in the last five years. Moreover, it is now reasonably straightforward to create an initial graph representation from a TEI-encoded XML file, and nearly as straightforward to produce multiple “flavors” of TEI XML from a single graph—for example, a document-oriented representation alongside a logical-text-oriented representation, such as those created in the Faust edition project (Brüning et al., 2012).

Graph representation does not relieve us from the constraint that digital text must be formed from discrete atomic units (characters) and discrete compound digital units. What it does allow us to do, however, is to digitally model the shades of meaning, ambiguity, and uncertainty of text—the aspects of a text’s meaning or interpretation that are not by their nature discrete at all. So far, uncertainty of interpretation has been all but explicitly avoided in the scholarly digital space. McGann (2004) observes: “In the case of a system like the Text Encoding Initiative (TEI), the system is designed to ‘disambiguate’ entirely the materials to be encoded.” The 1700+ pages of the TEI guidelines bear witness to the difficulty of this task; element names and their intended usages are defined as precisely as possible, with illustrative examples.

The use of tags and attributes in a TEI-encoded file is itself an act of interpretation—it is precisely in the labelling of the elements of an encoding and of the relationships between those elements that much of the scholar’s interpretation of the encoded text lies. This is equally true of a graph encoding. The prolixity of the Guidelines is in fact a result of the attempt to constrain the scholar’s interpretation of a given tag to a disambiguated, and thereby discrete, set of meanings. Rather than shrinking from uncertainty of meaning, however, rather than resolving it in an occasionally misleading or (paradoxically) unhelpful attempt to ease computational analysis, how much better if the uncertainty can be retained in the model, in a manner that is nonetheless machine-parseable? Interpretation remains inherent in the labels that are chosen for the properties in a graph model; these may be taken from the TEI, or they may reflect another interest of the scholar. Either way, a graph-based text model by its nature includes computationally tractable information about how these properties relate to each other.

In representing “fuzziness” or ambiguity of interpretation, we can perhaps follow the lead of those who have struggled with similar problems of representation of time. If text can be represented as a graph, mostly-sequential but with scope for fluidity, then interpretative elements—say, picking out the theme of isolation within the text—can be represented as spanning a continuous or disjoint range of text sequences, beginning not before word A but certainly by word B, and continuing at least as far as word C but certainly not beyond word D. And so on. If scholars disagree on the correct identification of a person in a historical text, both hypotheses can be represented without innate primacy being given to one over the other, because a graph does not enforce a single ordering of its elements.

The text model we are advocating, then, amounts to a form of knowledge representation; in representing “the text”, we are actually representing a collection of our knowledge, understanding, and interpretation of that text, in a form that can be analyzed and processed by a machine. The more we can encode texts as computable knowledge, the greater the power computational methods will have in textual scholarship of all forms.

Bibliography

Andrews, T.L., Macé, C., 2013. “Beyond the tree of texts: Building an empirical model of scribal variation through graph analysis of texts and stemmata.” Literary and Linguistic Computing 28 (4): 504–521.
Barthes, Roland. S/Z. Trans. Richard Miller. London: Cape, 1975.
Briet, S., 1951. Qu’est-ce que la Documentation?, Paris: Édit. Translated as S. Briet, What is Documentation?: English Translation of the Classic French Text, Lanham, Md: Scarecrow Press. Available at: http://ella.slis.indiana.edu/~roday/briet.htm [Accessed 1 November 2015].
Brüning, G., et al., 2012. “On the dual nature of written texts and its implications for the encoding of genetic manuscripts”. Presented at the Digital Humanities 2012, Hamburg.
Dekker, R.H., van Hulle, D., Middell, G., Neyt, V., van Zundert, J., 2014. “Computer-supported collation of modern manuscripts: CollateX and the Beckett Digital Manuscript Project.” Digital Scholarship in the Humanities 30 (3): 452-470.
Goldfarb, C.F., The Roots of SGML -- A Personal Recollection. Available at: http://www.sgmlsource.com/history/roots.htm [Accessed 1 November 2015].
Grossner, K. and Meeks, E., 2014. “Topotime: Representing Historical Temporality.” Presented at the Digital Humanities 2014, Lausanne .
Jänicke, S., Geßner, A., Büchler, M., Scheuermann, G., 2014. “5 Design Rules for Visualizing Text Variant Graphs.” Presented at the Digital Humanities 2014, Lausanne.
Marcoux, Y., Sperberg-McQueen, M.C. & Huitfeldt, C., 2013. Modeling overlapping structures: Graphs and serializability. In Proceedings of Balisage: The Markup Conference 2013. Balisage Series on Markup Technologies. Balisage: The Markup Conference 2013. Montréal, Canada. Available at: http://www.balisage.net/Proceedings/vol10/html/Marcoux01/BalisageVol10-Marcoux01.html [Accessed 1 November, 2015].
McGann, J., 2004. Marking Texts of Many Dimensions. In S. Schreibman, R. Siemens, & J. Unsworth, eds. A Companion to Digital Humanities. Oxford: Blackwell. Available as chapter 16 at: http://www.digitalhumanities.org/companion/ [Accessed 1 November 2015].
Miola, R.S., 2000. Seven Types of Intertextuality. In M. Marrapodi, ed. Shakespeare, Italy, and Intertextuality. Rome: Bulzoni Editore, pp. 23–38.
Nelson, T.H., 1993. Literary Machines. The report on, and of, Project Xanadu concerning word processing, electronic publishing, hypertext, thinkertoys, tomorrow’s intellectual revolution, and certain other topics including knowledge, education and freedom. First published 1981., Mindfull Press.
Petzold, C., 2000. Code: The Hidden Language of Computer Hardware and Software. Redmond: Microsoft Press.
Pierazzo, E., 2011. “A rationale of digital documentary editions”. Literary and Linguistic Computing 26 (4): 463–77.
Renear, A., 2004. Text Encoding. In S. Schreibman, R. Siemens, & J. Unsworth, eds. A Companion to Digital Humanities. Oxford: Blackwell. Available as chapter 17 at: http://www.digitalhumanities.org/companion/ [Accessed 1 November 2015].
Schmidt, D., Colomb, R., 2009. “A data structure for representing multi-version texts online.” International Journal of Human-Computer Studies 67: 497–514.

Notes

While this is not stated explicitly anywhere of which we are aware, the underlying assumption is plain in, e.g., Goldfarb’s discussion of the roots of SGML (1996) or Renear’s definition of text encoding (2004).