DH 2016 Abstracts

Social Networks and Archival Context: People and Cultural Heritage

1. Overview

Social Networks and Archival Context (SNAC), initiated in 2010 as a R&D project, is now being transformed into an international cooperative. SNAC’s original research objective was to demonstrate that descriptions of people, embedded in the descriptions of historical records that document their lives, could be extracted and used to reveal the social networks within which their lives were lived and provide integrated access to geographically dispersed historical records. SNAC’s early success led to plans to establish a sustainable international cooperative to maintain and expand these descriptions of people. The long-term technological objective is a platform to support a continuously expanding, curated corpus of reliable biographical descriptions of people linked to, and providing contextual understanding of, the historical records that function as primary evidence for understanding their lives and work. The SNAC Cooperative will benefit librarians, archivists, and researchers, and will provide traditional historical researchers integrated access to distributed historical records and the social contexts within which the records were created and used. It will provide prosopographic researchers with methods for reconciling and establishing reliable social networks, and will enable archivists and librarians to share descriptive data while also making descriptions more effective.

2. The Archival Description Source Data

Archival description source data encompasses both descriptions of historical records as well as authority data for corporate bodies, persons, and families documented in historical records. OCLC WorldCat, sixteen archival consortia (representing hundreds of individual repositories), over thirty repositories, and two digital humanities research projects contributed their source data to SNAC. The holdings of over 4000 repositories are represented:

190,000 finding aids, contributed by fifteen consortia and over thirty repositories in the U.S., the ArchivesHub in the U.K., and the Bibliothèque nationale de France (Catalogue Collectif de France (CCFr) and BnF archives et manuscrits)
2.25 million OCLC WorldCat archival descriptions
400,000 authority records contributed by NARA (93,051), the British Library (297,731) the Smithsonian Institution Archives (2,083), the New York State Archives (258); and the Archives nationales, France (2,350)
30,000 correspondent names from the Joseph Henry Papers Project, Smithsonian Institution Archives
2,332 correspondents from The Walt Whitman Archive
1,200 names associated with the Chaco Research Archive

3. Data Processing

During SNAC’s R&D phase (2010-2015), this source data was processed in three distinct steps.

Biographical and historical data was extracted or migrated from existing archival descriptions and assembled into standardized descriptions that identify and document organizations, persons, and families based on an international archival communication standard, Encoded Archival Context – Corporate Bodies, Persons, and Families (EAC-CPF). Each EAC-CPF identity description includes the description of the entity as such (names, life dates, biographical information, etc.), links to descriptions of the historical records from which the data was derived, and links to other identities found in the same source. These links provide the foundation for assembling a vast social-document network or graph.

The EAC-CPF identity descriptions were matched (identity reconciliation) against one another and against descriptions in the Virtual International Authority File, combining records that identify the same entity, to produce a set of unique EAC-CPF records.

We developed a prototype access system, based on Extensible Text Framework (XTF), open source software from the California Digital Library. It has three major functional components: 1) display of the EAC-CPF records; 2) sophisticated searching and exploration of the EAC-CPF records; and 3) exposing the data to enable third-parties to access and use it in other applications.

3.1. Extracted or Migrated Data

The first step resulted in 6,719,064 Encoded Archival Context – Corporate Bodies, Persons, Families (EAC-CPF: an archival encoding standard hosted by the Society of American Archivists and developed in collaboration with the international archival community).

4,653,365 Persons
1,868,448 Corporate Bodies
197,251 Families

3.2. Merged Data

After performing identity resolution processing (match and merging), we had:

3,741,262 EAC-CPF records
- 2,466,425 persons
- 1,077,588 corporate bodies
- 197,249 families ¹
7,966,737 links between the 3,741,262 persons, corporate bodies, and families
15,031,209 links to 2,079,504 unique resource descriptions

3.3. Prototype History Research Tool

The prototype history research tool (http://socialarchive.iath.virginia.edu/snac/search ) allows researchers to find persons, organizations, and families; to read biographic information about them; to explore the social networks within which they existed; to locate historical records that document their lives, related resources, and external links associated with that name. Associated links are provided for ArchivesGrid and Digital Public Library of America, as well as “sameAs” links to Wikipedia, VIAF, WorldCat Identities, and others.

4. Significance for Researchers

Researchers have welcomed SNAC for its research economies: SNAC’s History Research Tool provides integrated access to distributed primary (archival) and secondary (published) resources, eliminating or at least substantially ameliorating the need to track down resources in multiple archival catalogs. Painstakingly locating these resources is a labor-intensive, time-consuming activity in the current research environment, with successful discovery and assembling of the data highly dependent on persistence and serendipity. Indeed it is likely that some of the information found in the SNAC records might never be discovered using current methods. SNAC also makes explicit what has been, at best, implicit in archival description: the social-professional-intellectual networks within which the lives and work of the people documented in historical resources took place. It exposes the vast global social-document network that connects the past to the present. Ed Ayers, President of the University of Richmond and a Civil War historian, wrote that:

SNAC promises to change the way history is imagined and written! For all that the digital revolution has revolutionized, the heart of research lies within the primary record embedded in archives large and small. The pioneering work of SNAC will unlock that record, revealing connections and patterns invisible to us now.

Alan Liu, Professor of English, University of California, Santa Barbara and Director of Research Oriented Social Environment (RoSE), describes SNAC’s potential:

SNAC employs state-of-the-art computational techniques to do three things very well: 1) unlock information originally recorded for specific purposes in library and other archival finding aids to make them usable in new contexts; 2) connect widely-distributed information of this sort from around the world; and 3) marry the “library” or “archive” model of knowledge to a whole other model of social networks that both humanizes our understanding of the way knowledge emerges from communities of knowledge creators and seekers, and speaks powerfully to today’s “social network” generation.

5. Significance for Prosopographical Research

SNAC is building a humanities resource that benefits humanities researchers, but ongoing development and refinement of identity reconciliation techniques are of further benefit to humanists engaged in prosopographical research. Names alone are weak identifiers: multiple people can have the same name and one person may have multiple names. A number of factors influence our ability to reliably identify people. Indeed, the larger the domain from which names are drawn, the higher the likelihood that a name is shared by several people.

Though each step in the processing described above presents intellectual and technical challenges, the most challenging is identity reconciliation. A fundamental human activity in the development of knowledge involves the identification of a unique “real world” entity (e.g., a person or book) and recording facts that, when taken together, uniquely distinguish that entity. Establishing the identity of a person, for example, involves examining available evidence, including the existing knowledge base, and recording facts associated with him or her (such as names, dates and places of birth and death, occupation, etc.). This is an ongoing, cumulative activity that both leverages existing established identities and establishes new identities. Identity reconciliation is the process by which an encountered identity is compared against established identities, and if not matched, is itself contributed to the established base of identities. The networked computing environment presents opportunities for using algorithm-based inference methods to compare newly encountered entities with established identities to determine the probability that a new entity represents the same person or thing as an established identity. This ongoing expansion of the base of reliable identities is an interplay of human research, knowledge recording, and computational methods.

6. Transforming SNAC into an International Cooperative

It became clear early on that the biographical data extracted and assembled from archival resource description constituted a valuable independent resource that could (and should) be maintained and further developed cooperatively. Development of a cooperative began back in 2011 and it recently entered its pilot phase with a group of fourteen inaugural institutional members that support the potential benefits of aggregated description and access demonstrated to date in SNAC, and, further, embrace the idea that the resources amassed should be cooperatively built and maintained in order to fully realize these benefits. The initial members represent research archives, libraries, museums (art and natural history), government archives, and institutional archives. The U.S. National Archives and Records Administration (NARA) serves as the secretariat for the Cooperative, while the Institute for Advanced Technology in the Humanities (IATH), University of Virginia, hosts the technological infrastructure. SNAC is led by IATH, working collaboratively with NARA, the California Digital Library, and the iSchool at UC Berkeley. The National Endowment for the Humanities (2010-2012), the Institute for Museum and Library Services (2011-2013), and the Andrew W. Mellon Foundation (2012-2017) have provided funding for SNAC.

Notes

Because family names, as traditionally formed, lack sufficient qualifying information and thus commonly result in false positives, no matching was done against family names. In the final production, two family names were rejected as malformed.