DH 2016 Abstracts

Linked Ancient World Data: Relating the Past

Bringing together researchers from a wide variety of disciplinary backgrounds, this panel discusses the newly emerging ecosystem of Linked Ancient World Data projects and resources. One of the fastest growing areas of Digital Humanities, Linked Open Data (LOD) has the potential to transform traditional scholarship through its ability to promote the discovery of, and connections between, online documents of a highly varied nature (texts, maps, databases, images, etc.). Yet many barriers that are limiting its uptake and application, both technical and human, need to be addressed before this potential can be realised. This panel explores the issues relating to LOD, the semantic web and RDF technologies by focusing on three case studies drawn from the ancient world of literature, history and archaeology: the SNAP, Pelagios, and Integrating Digital Epigraphies (IDEs) projects.

Each of these three projects takes a different focus for their linking strategies: SNAP aims to connect documents through the people mentioned within them (prosopographies and onomastica); Pelagios through places (maps and gazetteers); and IDEs tackles different kinds of written material that survive from the ancient world (inscriptions and papyri). The projects are all united, however, in a concern for the use of, and access to, massive and diverse datasets that cannot be curated, aggregated or even archived in a single location. One major challenge to be addressed is the inherited scholarly infrastructure that tends to shoehorn multiple projects into a single institution’s repository and data model. These projects and their participants are also concerned with issues far beyond their primary subject area: the interoperability of bibliographical references, citations of ancient sources, encoding of date and time, events and actors, material objects and their curatorial history all contribute to the study and understanding of the ancient world (and mutatis mutandis of any other). All also recognise that there is no firm demarcation between the cultures of the Mediterranean in the classical period, nor between the worlds and cultures bordering them in time and space. Data from the Bronze Age and Mediaeval periods can be read profitably in and against this classical focus; our sources do not exist in a vacuum that can be entirely insulated from the ancient Near and Far East, sub-Saharan Africa or pre-Columbian America, for example. The three papers in this panel will all discuss how they are deploying the formats and technologies of Linked Open Data to address the massive and multidisciplinary interoperability that these historical challenges require.

Paper 1.

Networking Ancient Person-data: community building and user studies around the SNAP:DRGN project

Gabriel Bodard (University of London)

Tom Gheldof (KU Leuven)

Faith Lawrence (King’s College London)

Simona Stoyanova (King’s College London)

Charlotte Tupman (University of Exeter)

The Standards for Networking Ancient Prosopographies (SNAP: http://snapdrgn.net) project is using linked open data (LOD) to build a virtual authority list for ancient people through aggregation of common information from collaborating projects. A unified authority of ancient persons will serve as a convenient and powerful single resource for prosopographers, text editors and scholars to use for disambiguating person references by means of annotations that record the specific URI of a person identified by the SNAP graph.

The objective is neither to create a new universal dataset of historical persons, nor to ingest or supplant the many valuable prosopographical resources, both analogue and digital, created over the past many years. Rather, through the creation of a single entry point—and related identifier—coupled with a small subset of common fields made available both to human researchers and for automated processing, SNAP aims to facilitate interoperability and interchange, exploitation and discovery through common metadata, and the recording of both known and newly discovered relationships between person records. Users will be enabled and encouraged to (a) annotate their data with SNAP URIs to disambiguate person references, and (b) add structured commentary to the SNAP graph in the form of scholarly assertions, bibliography and apparatus.

This paper will outline our efforts to engage both the scholarly community and the wider public in the development of the SNAP model, and discuss the importance of user analysis and feedback into the design and functionality of the user interface and research tools. It is essential both for the utility of the project, and to encourage scholarly uptake in the person data and use of the virtual authority records, that we base our development on consultation with communities of our anticipated user groups, both scholarly and more widely.

In the first phase of the project, SNAP began to address the core issue of linking together large datasets containing information about persons, names and person-like entities (families, associations, deities, anthropomorphic animals) managed in heterogeneous systems and formats. Ambiguous co-referencing is a ubiquitous issue within the linked data world; how does a researcher or analyst determine whether two records refer to the same person or are related in some other way? Even more trickily, what other related information referring to one record can be said equally to apply to both people? The SNAP dataset attempts to address this issue by retaining scholarly metadata around the assertion of co-references and relationships, so that ambiguity, disagreement and academic justification can be recorded alongside all statements that can potentially lead to inference of new relationships.

SNAP models a simple structure using Web and LOD technologies to represent relationships between databases and to link from references in primary texts to authoritative lists of persons and names. The core of its source material was built around three large historical prosopographies and onomastica (databases of persons and names) from the ancient world: the Lexicon of Greek Personal Names, an Oxford-based corpus of some 300,000 persons mentioned in ancient Greek texts ( http://www.lgpn.ox.ac.uk/); Trismegistos, a Leuven-run database of over half a million names and persons from Egyptian documents ( http://www.trismegistos.org/); and Prosopographia Imperii Romani, a series of printed books listing senators and other elites from the first three centuries of the Roman Empire ( http://pir.bbaw.de/). With several other more specialist databases of ancient and Mediaeval persons, museum and library catalogues, and digital editions contributing data or in the process of converting their data to the RDF format SNAP requires, the virtual authority will soon record well over a million person URIs. Due to the focus on historical datasets, we are able to address wider issues of dealing with person data without the ethical and privacy concerns raised by that gained from modern social networks. While still massive in scale, the amount of data under discussion is tractable, allowing for more academic coherence and review within the data, which, diverse as it is, is produced by a discipline with well-established working practices.

During this first phase, SNAP held a number of meetings and presentations to introduce the principles of and the preliminary work done by the project in its pilot period, and to hear from potential project partners about their datasets, practices and reactions to our proposals. These discussions led to a greater understanding of the nature of the prosopographical materials as well as helping to identify future partners. In addition, they provided the opportunity to advise participants how to present the relevant subset of their data for SNAP import in order to allow further datasets to be ingested, by demonstrating the SNAP Cookbook ( http://snapdrgn.net/cookbook/) which sets out details of several scenarios for the encoding, publication and linking of ancient person data in RDF, and connecting them to the SNAP graph.

The second phase of the SNAP project focuses further on the ingest, creation and linking of a much wider range of person data, using a range of methodologies including named entity recognition (NER), both hand and machine-assisted curation of person references from large corpora including inscriptions and mythological sources, and the ingest of data from existing projects via prosopographical tool kits such as the Berlin-Brandenberg Academy’s Personendaten Repositorium ( http://pdr.bbaw.de/) and the Berkeley Prosopography Services ( http://berkeleyprosopography.org/).

At the same time, we aim to enable a wider interchange of data and discovery of related materials as part of the larger Linked Ancient World Data community. In order to engage scholars and other interested parties in both creating and linking to SNAP identifiers, and to support and encourage the use of research tools and interfaces, we will hold a series of user analysis and engagement workshops, focusing on scholarly feedback on the SNAP web interface, API and widgets, the use of disambiguating annotation, and creation of and engagement with structured commentary. These workshops will help us assess and better understand the expectations and needs of our user communities with regards to both infrastructure and support, and sustained engagement with the project. User analysis is vital for developing an understanding of how users interact with the data and the current tools that are available for working with prosopographical data. We need to understand the goals and workflows of our user groups in order to meet their scholarly needs, and to create effective methods for attracting and maintaining engagement with the next phase of the project.

The issues being investigated by SNAP have implications beyond the bounds of this particular project: many of the issues such as variant name spellings, persons with changing or ambiguous names, uncertain identities and relationships, and tracking assertions about persons and the cascading inferences resulting from scholarly or editorial decisions, are precisely the same questions that concern both professional and amateur groups working on person-identification, including local historians, family historians, genealogists and graveyard conservationists. This has two major implications: first, as the work being done by SNAP is likely to reach far beyond its immediate subject area, this must be reflected in the way the project is conducted and disseminated to a variety of groups; and second, we must ensure that our user engagement workshops therefore include not only scholars but members of the public whose interests overlap with those of the SNAP project in these and other ways. This paper will seek audience discussion and advice about how best to ensure that the second phase of the project can meet such potentially diverse user needs.

Paper 2.

Early Geographic Documents and the Pelagios Commons

Leif Isaksen (Lancaster University)

Rainer Simon (AIT Austrian Institute of Technology)

Elton Barker (The Open University)

Pau de Soto Cañamares (Institute of Catalan Studies)

Pelagios is an international initiative concerned with the development of LOD methods, tools and services so as to better interconnect the vast and ever-growing range of historical resources online. Specifically, it uses the Open Annotation RDF ontology ( http://www.openannotation.org/spec/core/) to associate place references within those resources to online gazetteers that offer URI-based identifiers for such places. The resulting graph is then exploited in a variety of ways to facilitate research, teaching and public engagement. The Pelagios 3 project expanded the scope of Pelagios dramatically from its original focus on classical antiquity, to encompass the early geographic documents of the pre-modern era, including early Christian, Islamic and Chinese traditions. It addressed three critical challenges for stimulating activity in these areas:

First, we developed user-friendly Web-based and Open Source software tools for the production and exploration of Pelagios LOD. Recogito ( http://pelagios.org/recogito/) is a Web-based tool for the semi-automatic annotation of place references. It features several work areas, dedicated to different stages of the geo-annotation workflow: (i) a text annotation area to identify place names in digital texts or tabular documents (optionally aided by automatic Named Entity Recognition); (ii) an image annotation area to mark up and transcribe place names on high-resolution map or manuscript scans; (iii) a geo-resolution area, where identified (and transcribed) place names are mapped to a gazetteer, supported by an automated suggestion system. Recogito also provides basic features for managing documents and their metadata, as well as for viewing annotation results, usage statistics and bulk-downloading annotation data. Peripleo ( http://pelagios.org/peripleo/map) is a spatio-temporal search engine for exploring the annotation data produced through Pelagios 3, as well as by the Pelagios community at large. Its user interface resembles that of Google Maps, and allows for free browsing as well as keyword & fulltext search, while offering additional filtering options based on time, data source and object type.

Second we carried out much annotation both in house and by independent contributors, so as to provide a ‘critical mass’ of annotated text and map documents that would attract contributions from other data curators. Over the course of the project 90 registered editors identified approximately 130,000 place references in 317 early geographic documents in 8 languages. About half of these were manually inspected for association with a gazetteer. Around 60 institutional or personal partners have contributed to Pelagios to date with a similar number expressing interest in doing so. We believe this offers substantial evidence that LOD approaches do not of necessity impose high barriers to entry, and on-ramps to semantic technologies can be offered at a varying levels of complexity.

Third, we developed a mechanism for enabling different gazetteers (each serving their particular community) to be interoperable, allowing for interlinking between data from divergent traditions. This has been achieved through the development of the Pelagios Gazetteer Interconnection Format which provides baseline requirements and optional additions for gazetteers to interoperate ( https://github.com/pelagios/pelagios-cookbook/wiki/Pelagios-Gazetteer-Interconnection-Format). While such decentralized models for key infrastructure are both evolving and not without their challenges and risks, they offer significant potential for resolving conventional problems with enforcing universal standards across multiple domains and communities of practice.

Consequently, Pelagios has generated sustained and lively community interest, and has offered a pioneering model for other LOD initiatives which are semantically annotating different reference types from people to time periods, including PeriodO ( http://perio.do/), SNAP ( https://snapdrgn.net/), PastPlace ( http://www.pastplace.org/) and al-Thurayya ( http://maximromanov.github.io/projects/althurayya_02/). The success of the Pelagios approach has also attracted funding for academic research into early geographic documents through the Pelagios 4 project which is working with specialists in historical geography to identify both the advantages and limitations of semantic annotation for comparative studies and visual and statistical analyses. Topics span from the significance of hazard depictions on medieval portolan charts to the use of reliability of textual sources as proxies for the missing sections of the only extant Roman world map. In addition to this academic research, the SEA CHANGE project trialled crowdsourcing workshops for the use of semantic annotation in Higher Education in collaboration with i3 Mainz and the University of Heidelberg ( http://pelagios-project.blogspot.co.uk/2014/11/bringing-about-sea-change.html).

In parallel with these developments a community of practitioners has emerged with interests in a range of related activities: the annotation of curated or third-party content; the production of specialist gazetteers; the integration of place annotations with those of people, periods and things; and the visualization and analysis of graph-based data, to name but a few. Since its early stages Pelagios has made concerted efforts to consult and support such stakeholders, but as it has grown new opportunities and challenges have emerged. In particular we have established that within a heritage context, one of LOD’s key advantages is its ability to relate independently maintained projects without requiring a single centralized authority. But what are the social ramifications of such an approach? In a world in which funding criteria, academic legitimacy, intellectual property, and even conference presentations presume the authority of individuals and institutions, can LOD communities ever scale effectively? In order to do so individual stakeholders will need to shoulder responsibility for specific services within them. These may be the provision of content, real-time search aggregators, or dynamic real-time operations that offer visualization and analysis of heterogenous material. For some, reliability will be more important than complexity or innovation, while others will pioneer new strategies at the cost of longevity or broad usership. Negotiating the relationships between these stakeholders—technically, socially and legally—is perhaps the greatest task ahead for those seeking to establish Linked Open Data as a principal mechanism for drawing together, if not necessarily synthesizing, information about the past.

In addition to reviewing the outputs of Pelagios 3 and Pelagios 4, this paper will report on early developments within Pelagios Commons, a new phase of Pelagios which focuses explicitly on increasing its technical and social decentralization. This spans beyond its current pre-Modern and literary scope, in order to embrace later periods, differing scales of geography (from intra-urban to multi-regional) and the conceptual changes of dealing with arbitrary findspots and mythical, fictional and itinerant places. It will present our experiences in establishing Special Interest Groups, and the different challenges faced in devolving LOD architectures, as well as lessons learned from similar initiatives. In doing so we hope to foster discussion and critique from those planning or implementing related community-driven projects.

Acknowledgments: Pelagios 3 was funded by the Andrew W. Mellon Foundation. Pelagios 4 was funded by the AHRC. SEA CHANGE received an Open Humanities Award from D2ME and the Open Knowledge Foundation. We thank all our financial supporters and content contributors for the collective production of this work.

Paper 3.

Integrating Digital Epigraphies

Hugh Cayless (Duke University)

The Integrating Digital Epigraphies (IDEs) project aims to build on the lessons learned in the course of developing the Papyri.info project. The differences in the digital landscape between Greek Epigraphy and Papyrology are considerable, the main one being that, whereas many of the partner projects of Papyri.info were happy to permit that site to aggregate their data, IDEs will not be able to host partner data, but rather to collect citation and linking information with the goal of improving the links between the different Epigraphical sites. This paper will discuss the project, hosted at http://ides.io and the part played in it by the Linking Ancient World Data Ontology, which may be found at http://lawd.info.

What is a citation? It is a sequence of characters in a text that refers to something the reader may wish to consult. A citation is obviously a pointer of some sort, but what is it a pointer to? It depends: a citation like Il 1.1 is a reference to book 1, line 1 of the Iliad, that is, it refers, notionally, to an ideal or composite Iliad, not to a particular expression. It is assumed that the first line of book 1 will be more or less the same in any edition. A citation like IG I³ 40 on the other hand, is to a particular edition: number 40 in the third edition of the first volume of Inscriptiones Graecae. It points therefore to a particular part of a larger work. S.C. de Bacch., to take a third example, is a citation of an actual inscription, not a publication of that inscription. Here, the citation refers to a physical object, and the reader is expected to be able to find a text of it if they wish to read the document.

Given all this, we must conclude that if a citation is a pointer, it is a very vague one. It may point to an abstract work, an actual (perhaps online) edition or part of it, or a real-world object with text written on it, but which the reader of the citation probably isn’t expected to go and read in person. Further complicating the situation, the referring strings that comprise citations are subject to different formatting conventions: IG I³ 40, and IG I[3].40 refer to the same edition, for example. Even more extreme variation is entirely possible, depending on the conventions used by the publications citing the source.

How do we model this situation then? Citations are strings that indicate resources or parts of resources, which may or may not be abstractions and may or may not have published editions, translations, etc.. The LAWD Ontology approaches this by modeling the Citation as its own RDF Class which may represent a written or conceptual work. Citations may have a value that is either a string, a URI, or both, and other properties may be attached to them as well.

In IDEs, Citations are the main component of the project. IDEs works by identifying epigraphic citations from a variety of sources, including the Packard Humanities Institute (PHI) site, the Diccionario Griego-Español’s Claros project, and the Supplementum Epigraphicum Graecum (SEG). PHI contains the text of editions of Greek inscriptions, Claros collects pairs of citations where one source cites or updates another, and SEG publishes a kind of annotated bibliography of epigraphic publications. All of these manifest different citation practices, even though they deal with largely the same body of material. IDEs attempts to parse the citations from each project, match them up when that is possible, and present an interface and APIs that permit projects with epigraphic citations to retrieve related material easily. For example, if PHI wishes to find and display citations related to the inscription they assign ID number 40 to ( IG I³ 40), they can query the URI http://ides.io/browse/ides:phi:40 which resolves to http://ides.io/browse/ides:t000003n (the IDEs ID or IDEst of the inscription itself), and from there they can retrieve machine readable data in JSON, RDF, or JSON-LD formats that could be incorporated into their own page at http://epigraphy.packhum.org/text/40. It would be possible, for example, to link to related articles in SEG, to display additional bibliography, to link to the corresponding place entry in Pleiades, and when we have incorporated data from JSTOR, to link to articles that mention IG I³ 40.

IDEs itself is a property graph database which uses RDF semantics without relying on an underlying RDF database implementation. This approach is intended to allow it to permit commenting on and editing relationships between the entities it tracks. RDF by itself has trouble attaching extra information to triples, requiring reification or the use of named graphs to do so. We hope that by imposing RDF semantics on a property graph structure, we can have the ability to attach additional metadata and commentary to relations without sacrificing speed or expressive capability.