This paper presents the results of an effort that our research team has done in order to develop an OWL 2 ontology to formally define the semantics of the Text Encoding Initiative markup language. The preliminary steps of this research project have already been presented at the TEI Conference in 2014 and 2015 (Ciotti and Tomasi 2014, Ciotti et al., 2015). We believe that our work has reached a satisfactory level of development, both on the theoretical side and in the practical implementation.
The reasons to have a formal and machine-readable semantics for TEI are manifold. In the first place we can set forth a list of pragmatic and technical benefits that have been already pointed out in many previous works dedicate to this topic, that dates back to the mid-90s (Di Iorio, Peroni and Vitali, 2009; Ciotti and Tomasi, 2014). Here is a brief summary of those arguments:
The advantages envisioned in this list are not specific to the TEI or aim to facilitate the relationships between different markup languages; but some of the issues have special relevance for TEI and for the usage of TEI inside its reference community.
Take for instance the query issue: we all know that there are many ways of expressing one and the same textual feature in TEI markup, so that it is very difficult to query heterogeneous TEI corpora and text archives. Having a set of ontological definitions of the conceptual level behind markup, that is, a set of shared formal definitions of the textual features to which any single encoding project could bind idiosyncratic markup usage, could help solve this problem. The same argument could be made for a far more adequate management of interoperability of TEI text collections between different repositories or applications.
But we believe there is also a deeper theoretical and foundational advantage in the idea of an ontological semantic model for TEI. It is a commonly acknowledged notion that the very core of digital methods application in humanities research is the notion of model/modeling. The pair of terms “model/modeling” is deplorably understood in many different ways in the community. We think that, as far as we are using Turing machine like device for computation, the only workable notion of modeling is a formal one: model we should be interested in are formal models. Where formalization is to be understood as a series of semiotic processes that generates an algorithmically computable representation of one (or more) phenomenon/object.
It is widely recognized that the TEI is not only a markup facility but first and foremost a conceptual model of textuality. In fact, the Guidelines (TEI Consortium, 2015, chap. 23) explicitly introduce the notion of a TEI Abstract Model. The fact is that the notion of an abstract model is used in many formal procedures but this very notion is not formally defined. This ends up in a lot of problems and circularities. We think that we need to have a formalized account of the quasi-formal notion of TEI abstract model, if it has to be of any use other than a sort of regulatory principle.
We do not advocate going back to a monist theory of textuality. Our suggestion to adopt contemporary Semantic Web formalisms to build this abstract conceptual model give us the possibility to have a “foundation” of TEI in a well-defined data model that is not dependent on the notion of a single hierarchical “ordered hierarchy of content object” (OHCO, DeRose et al., 1997), and that can accommodate, at least to some extent, the “pluralities” of textuality.
TEI as a whole is very complex, and its usage is governed by pragmatics and contextual requirements. We acknowledge that it is impossible to reduce to a unique formal semantic definition this fuzzy cloud. Though, we can identify a subset of shared assumptions, a common ground of notions about the meaning of TEI markup and the nature of documents like object: we think that this subset can be the object of an ontological formalization. For various reasons we have adopted the TEI Simple customization (Cummings et al., 2014) as an acceptable approximation of this common ontology. This is not an opportunistic ad hoc choice, as it may seem. TEI Simple in fact has been defined by a group of domain expert that have analyzed the actual usage of TEI markup in some big textual repositories and have selected and organized a set of one hundred or so elements that can describe all the textual features represented by the markup in those documents. This fits perfectly in the definition of a formal ontology development process.
The main design requirements for building our ontology have been the following:
In accordance with these overall principles we have decided to implement a complex architecture using some pre-existing meta-ontology frameworks to express the meaning of TEI element set by the way of the classes and properties they define. In particular we have adopted:
1) LA-Earmark (Di Iorio, Peroni, Poggi, Vitali, 2011; Peroni, Gangemi, Vitali, 2011), a markup metalanguage, that can express both the syntax and the semantics of markup as OWL assertions, and an ontology of markup that make explicit the implicit assumptions of markup languages. LA-EARMARK is an extension of EARMARK with the Linguistic Act module of the Linguistic Meta-Model that allows one to express and assess facts, constraints and rules about the markup structure as well as about the inherent semantics of the markup elements themselves.
2) Structural Pattern Ontology (Di Iorio, Peroni, Poggi, Vitali, 2014), whose goal is to identify a small number of patterns that are sufficient to express how the structure of digital documents can be segmented into atomic components.
The specification of markup semantics for the various TEI Simple elements is done by means of LA-EARMARK class and properties. The general Earmark class for any markup element is earmark:Element. The <abbr> element is defined as follows:
Prefix earmark: <http://www.essepuntato.it/2008/12/earmark#>
Prefix co: <http://purl.org/co/>
Prefix tei: <http://www.tei-c.org/ns/1.0/>
Class: tei:abbr a
earmark:hasGeneralIdentifier "abbr" and
LA-EARMARK allows us to link particular class of elements with the actual semantics they express. From our point of view there are at least two semantic levels that we explicitly define:
TEI Semantics Ontology is the core component that gives the actual semantics of TEI elements. Its definition is based on a categorization of the elements of the TEI Simple, based on a refactoring of the TEI model Classes.
The link between the class describing kinds of elements and their related semantic characterization is possible by means of the property “semiotics:expresses”. The associations of semantics to markup elements can be contextualized according to a particular agent's point of view, in order to provide provenance data pointing to the entity that was responsible for such specification. This is possible by means of the Linguistic Act Ontology included in LA-EARMARK that allows one to consider all these markup-to-semantics links as proper linguistic acts done by someone.
The work we have done so far is limited to the Simple subset of TEI. We envision some further development:
We think that in the long term this ontological formalization could become the primary formalization of the TEI encoding schema, independently of any serialization format. Today XML is still the better strategy to encode digital texts in real word projects for many practical reasons. But there is no reason for the TEI to be strictly based on it, as it is de facto now. Technical issues should not determine the choice of a formalization language. In the end, we believe that our effort can give a substantial contribution to the TEI to envision the shape of its own future.