DH 2016 Abstracts

Comparing Digital Scholarly Editions

1. Citing and comparing texts

Explicit and tacit assumptions of traditional text criticism have been questioned for decades, ¹ but the creation of digital scholarly editions has provoked discussion ranging from practical questions of method, to theoretical debate about what constitutes the nature of an edition in an electronic environment. ² In this paper, we address questions of what it means to compare two scholarly editions, and demonstrate applications of our approach to compare manuscripts of the Iliad.

One characteristic of a scholarly edition in any medium is that it makes a text canonically citable. Canonical citation is an essential prerequisite: in order to compare more meaningful units than streams of characters, we must be able to identify and align passages in different versions of a text. We use the technology-independent CTS URN notation ³ to identify books and lines of the Iliad. CTS URNs could be applied to any citation scheme, but logical schemes (such as those typically used to cite biblical passage by chapter and verse, or legal document by numbered section and subsection) are superior to arbitrary schemes based on physical artifacts (such as citing Plato by Stephanus page, or referring to the physical page of a specific edition of the works of Jane Austen) since they ensure that our comparison is organized in chunks of texts meaningful for scholarly analysis.

2. A model for text comparison

Traditional Homeric scholarship offers a useful model for comparing aligned citable texts. Homerists use the terms “vertical difference” and “horizontal difference” to describe two kinds of variety: “vertical difference” refers to entire lines that are present in one text but absent in the other, or that occur in a different sequence in the two texts. “Horizontal difference” refers to lexical differences within a single line. We can generalize the two dimensions of this approach, and understand our comparison of texts as determining the structural and lexical variation between texts.

Structural differences are simply differences in citation structure. For texts cited by CTS URNs, then, we can reduce the determination of structural differences to a comparison of ordered lists of each document’s CTS URNs.

Lexical differences are differences in the readings within a single citation unit (line of the Iliad, subsection of a legal text, etc.) The crucial question is: what produces a “reading”? Simply comparing streams of characters, or assuming that a stream of characters can be parsed into tokens based on some criterion such as splitting word tokens by white space or punctuation is dangerously underconceptualized. Instead, we recognize that any analysis that tokenizes a citable unit of text for a specified purpose produces an ordered list of tokens that we can compare in order to determine lexical differences between two texts. The lexical type of the token will depend on the goal of the comparison. As we subsequently illustrate, for example, we could analyze literal textual tokens, orthographically normalized tokens, or even morphologically or metrically analyzed tokens.

3. Implementing the model

When we determine structural (“vertical”) variation by comparing ordered lists of URNs and lexical (“horizontal”) variation by comparing ordered lists of tokens for each citable unit of text, we are performing exactly the same operation: comparison of ordered lists. This is one of the most studied and best understood problems in computer science, and a typical exercise in first-year programming courses. We have implemented the standard algorithm for Longest Common Subsequence (or LCS) ⁴ in a library freely available in source code or binary .jar for JVM languages. In addition to solving the LCS, the library determines what items appear in one list but not the other, and what items appear in both lists but in a different order.

4. Applying the model to the Iliad

We illustrate the possibilities of this approach by repeatedly comparing incompletely published manuscripts of the Iliad, focusing especially on Iliad 8 in two manuscripts, in the Biblioteca Marciana in Venice and in the Escorial monastery near Madrid. All of our comparisons find the same structural differences. (The run of lines from Iliad 8.466-8.468 is present in some manuscripts, for example, but absent from others.) The lexical comparisons, on the other hand, vary depending on the features we analyze.

We begin with a simple tokenization of the literal diplomatic text split on white space. Inventorying the tokens in each manuscript is essentially the collation phase of a traditional edition, but when we fully account for differences in punctuation, accent, abbreviation and spelling, the vast quantities of differences between manuscripts informs us about aspects of Byzantine orthographic practice that are normally suppressed in critical editions.

We next tokenize the same text to a normalized orthography eliminating punctuation, and adapting both accents and spelling to modern conventions. This comparison most closely approaches what we find in a typical critical edition (except that its listings of tokens present, absent or relocated in different manuscripts is comprehensive, rather than selective). Viewed from this perspective with orthographic differences removed, we find much greater agreement in the text of the Venice and Escorial manuscripts, although we still find passages like 8.137 where the reins of Nestor’s chariot are either “shining” (σιγαλόεντα) or “red-purple” (φοινικόεντα).

This comparison also reports differences in passages like Iliad 9.3, where a few manuscripts have βεβλήατο against the majority with βεβολήατο. The “differences” are actually equivalent forms of the same verb (an epic pluperfect of βάλλω). Depending on our interests, we might prefer to view these two literal variants as identical. We next tokenize the text not to representations of the specific form found in the text, but instead to the lexical entity (“dictionary form”) from which the word derives. In this tokenization, the same lexical entity is given for each of the two variant forms, and the lines are, by this reading, equivalent.

Since the formulaic variation illustrated by different readings for the same passage is metrically conditioned, we might also want to tokenize the text to metrical units. Like the preceding tokenizing to an abstract lexical entity, this is often considered beyond the scope of a critical edition, but we do not need to make any procedural distinction in our digital comparison. Reading the same line 9.3 metrically, for example, we next tokenize the dactylic hexameter into six metrical feet. In the majority manuscripts with βεβολήατο, we “read” the text with a dactyl in the third foot,

—⏖ —— —⏖ —⏖ —⏖ —×

while in the minority manuscripts with βεβλήατο we read a spondee

—⏖ —— —— —⏖ —⏖ —×

Each of these comparisons captures a distinct feature of the text. In every case, the analyses are keyed to the CTS URN of the text they analyze, so we can readily combine and compare the results of distinct analyses. In the Iliadic examples, we could equally easily identify passages that are metrically identical with either different vocabulary items or different forms of the same vocabulary item; or, as in 9.3, metrically distinct passages with identical vocabulary in different forms.

5. Rethinking digital editions

We recognize, as others have before, that many assumptions in the traditional practice of critical editing are self-contradictory and unnecessary in a digital environment. Digital editors are not constrained to eliminate the evidence of manuscripts judged not valuable (eliminatio codicum descriptorum); they do not have to select only significant variants (selectio) based on the evaluation of a limited set of crucial passages (examinatio locorum criticorum); they do not have to present material supporting their editorial choices in a critical apparatus that is flawed both by its circular logic of selectively publishing evidence and by its notational deficiency (a deficiency that has been clearly recognized only when the apparatus is computationally processed). ⁵ One of the most significant consequences of working with scholarly editions in a digital environment is the potential to automate systematic and comprehensive comparisons of various classes of features across a set of full diplomatic editions.

The comparisons of Iliadic manuscripts presented here further show that the analysis underlying a traditional critical edition is functionally no different than any other kind of analytical comparison: critical editing is one approach to analyzing a comparable set of texts. Our model of textual comparison compels us to specify unambiguously the process that generates our sequence of lexical tokens. This permits us to apply completely generic tools for comparing ordered lists, and to construct increasingly complex cascades of aligned analyses.

Bibliography

Boschetti, F. (2007). Methods to extend Greek and Latin corpora with variants and conjectures: Mapping critical apparatuses onto reference text. In Proceedings of the Corpus Linguistics Conference. CL2007. Birmingham. http://ucrel.lancs.ac.uk/publications/CL2007/paper/150_Paper.pdf (accessed March 4, 2016).
Pasquali, G. (1934). Storia della tradizione e critica del test. Florence.
Pierazzo, E. (Ed.). (2015). Digital Scholarly Editing: Theories, Models and Method. Ashgate.
Sutherland, K. and Deegan, M. (Eds.). (2009). Text Editing, Print and the Digital Worl. Ashgate.

Notes

As early as 1934, Pasquali already pointed out that the necessary assumption of the “Lachmannian method” that each copy derives from a single archetype is demonstrably wrong in many instances (Pasquali, 1934).

See for example the collection of essays edited by Sutherland and Deegan (Sutherland and Deegan, 2009), or the recent broad survey assembled by Pierazzo (Pierazzo, 2015).

On CTS URNs, see http://cite-architecture.github.io/ctsurn/.

https://en.wikipedia.org/wiki/Longest_common_subsequence_problem

An important but largely unrecognized implication of Federico Boschetti’s work parsing a critical apparatus of Aeschylus is that more than 10% of the entries in the critical apparatuis were not clearly enough expressed to be correctly mapped on to the section of the main text they annotate. This was not due to lack of diligence by the editors: rather, it reflects the notational ambiguity of the traditional apparatus. (Boschetti, 2007)