DH 2016 Abstracts

Diplomatic History by Data Understanding Cold War Foreign Policy Ideology using Networks and NLP

What is diplomatic culture or “rhetoric” and can we measure it?

This project is an attempt to understand quantitatively the language and structure of U.S. diplomacy as a bureaucratic institution through analysis of a large corpus of diplomatic papers. Since the “cultural turn” of History in the late twentieth century historians have produced cultural interpretations of American diplomacy that highlight gender and racial influences and justifications in diplomatic decision making but there have not been attempts to measure longue duree changes or to quantify them by a standardized measure. 1 This work hopes to fill this gap by introducing a new method of analyzing time-dependent bodies of text. I then apply these methods to a corpus of diplomatic papers to systematically chart changes in concepts and ideology detectable in diplomatic language.

The project has two aims. First, I am designing a method of measuring how a concept, as a fixed variable, evolves through a temporal corpus. Tools such as GloVe and Dynamic Topic Modeling are two existing approaches that can be used to understanding the linguistic shift and topic distributions over time. 2 I argue, however, that these tools are not adequate yet for time-sensitive tasks such as tracing concepts over time and attempt to design a new method optimized for this task. Second, mindful of the particular characteristics of this corpus as a set of diplomatic papers I want to apply the most appropriate methods of analysis and engage with the existing historiography on the Cold War to engage with claims historians have made about Cold War ideology. I am trying to answer: what is "Cold War rhetoric" in high level diplomacy? how did it evolve over the decades? and how did it become propagated within the diplomatic institution? I approach these questions from two methodological angles - networks and NLP.

1. Dataset

The dataset for this analysis is the entire text corpus of the Foreign Relations of the United States (FRUS, 1861-1980), a collection of published declassified diplomatic papers. The FRUS collection is hand-curated by librarians at the Office of the Historian to be a representative and comprehensive sample of American diplomatic history and has been a dependable primary source base for historians and social scientists for decades. The document types include telegrams, airgrams, notes, and memoranda among others. Most documents have accompanying meta-data such as the name of the sender, name of the recipient, location of the sender, and date when applicable. So far, I have focused on a subset of this corpus, consisting of all papers from 1948 to 1980, to analyze the high Cold War era. This subset is also what is available in xml format online with hand-tagged meta-data and reliably edited text. About 92,000 documents were available for analysis. 3 A notable caveat of this work is that the corpus, while presumably a representative sample, is still a sample of non-random selection and could carry both deliberate and unintended biases of the librarians. The FRUS is a corpus that becomes publically accessible upon publication and has a fixed audience of social scholars. There is an inevitable feedback loop where the curators respond to the requests of the prime users of the collection such as adjusting the proportion of the most „useful“ types of documents. For instance, as the Vietnam War is a highly contested field of scholarship curators may include a higher ratio of papers surrounding the Vietnam War thus distorting the overall representation of topics by exaggerating the war’s significance. This possible limitation is something to keep in mind throughout the analysis of this corpus.

Image A: Example of FRUS document; Stuart as the Ambassador in China to the Secretary of State sent from Nanking on April 25, 1949

2. Analysis

Descriptive summaries of the meta-data from the corpus shows several clear distributive patterns. From the total set of all documents 42,000 were correspondences (have to and from agents) from which I parsed all available meta-data (name of sender, name of recipient, location of sender, date) and made inferences on missing data based on historical knowledge. The results show that the Department of State (DOS) as a bureaucratic institution is highly centralized around key actors. Not surprisingly, the prime location of correspondence origin is Washington and its overwhelming predominance indicates that the DOS correspondence system was used for sharing information from the center to the peripheries. Similarly, the top senders of the correspondents were concentrated in high administrative positions – the Secretary of the State (SS), Department of State (DOS) administrative center, and the National Security Advisor (NSA). The individuals that have the highest correspondence authorship are therefore those who have held SS or NSA positions such as Dean Acheson (SS), John Foster Dulles (SS), Henry Kissinger (SS, NSA), and Walt Rostow (NSA). The top recipients of correspondences are also SS and DOS confirming a much bilateral relation between central and peripheral offices.

Graph 1A: Location of Correspondence Origin

Graph 1B: Top Correspondences sent from

Graph 1C: Persons that sent the most correspondences

Graph 1D: Top recipients of correspondences

I then mapped the network structure of correspondences to make the problem of bureacratic “culture” more concrete and visual. In this abstract, I have included images of correspondence networks from two discrete periods – when Kissinger served as Secretary of State from 1973-1977 and when Rostow served as National Security Adviser from 1966-1969. From the maps we can see the overall design of flow of information and transfer of knowledge based on the counts of correspondences and their directions. The maps confirm that indeed the bulk of the correspondence happens bilaterally between top administrative posts and peripheral agents. For instance, the DOS acts as the center point of correspondence for all embassies placed abroad and NSA as that for Washington based lower ranking administrative posts, such as the Under Secretaries of State. Further, two distinct communities emerge within the U.S. diplomatic institution. In both images, one can discern that the DOS and NSA act as distinct and separate focal points while the SS connects the two camps of correspondence and acts as a bridging agent of the two communities.

Graph 2A: Map of correspondence networks during the years Kissinger served as Secretary of State (1973-1977)

Graph 2B: Map of correspondence networks during the years Rostow served as National Security Advisor (1966-1969)

Chart A: Sample of select words’ change of top 10 GloVe neighbors from 1860s and 1950s (‘economy’, ‘empire’, ‘freedom’, ‘european’)

With this structural framework, I am using a combination of NLP methods to trace given “concepts” to see how they have changed in meaning over time. I identify concepts as individual terms, such as „liberty,“ or as a collection of related terms (topics). Word vectors are the most intuitive method of tracing changes in word meanings. Global vectors (GloVe) and other word vector models suppose a time stagnant corpus so I divided my corpus into decade-long chunks and worked with the assumption that language would not change in usage and meaning significantly enough to matter within ten-year time spans. My results comparing the nearing neighbors by Euclidean Distance of GloVe outputs of select concepts show there is a qualitative difference in conceptual meaning in the 1860s and 1950s. For instance, the concept „freedom“ in the nineteenth century was associated with more poetic and romanticized terms such as „triumph“ and „humanity“ whereas a century later it came to be linked with legal and defensive terms such as „right“ and „safeguard.“ Historians would contextualize this phenomenon in the American Civil War and the Civil Rights respectively. Then a question arises: Were diplomatic agents using terms that reflect the popularized lingo of their time or were their propagating it themselves? Who, in the DOS, initiated the usage of these concepts in such ways? Can we use the networks mapped above for to interpret this?

3. Discussion

This is a work in progress and there is still much work to be done in finding and developing new tools appropriate for time sensitive text data. Given history is a study of change and its significance, it is imperative that we do not assume a static distribution of words across time, eliminating many otherwise useful NLP tools. While Dynamic Topic Modeling considers time as a variable, it constructs a fixed number of topics based on a collection of words making it less favorable for corpora with minimal predictability and pattern as diplomatic papers. Unlike academic journals such as Science as Blei and Lafferty have applied their modelling on, diplomatic papers are much less consistent in topics. I have also discovered based on my experience of implementing these tools on the FRUS corpus that because in diplomatic papers certain topics predominate discussions at certain dates, I need to be aware of isolating these topics from purely semantic changes. For instance, in the 1940s, “communism” is closest neighbor based on GloVe results to “chinese” or “ccp” because of the Chinese Communist Revolution of 1949, which does not yield any surprising result about the meaning of “communism” in diplomacy.

Bibliography

[1] See how Hoganson introduces a gendered interpretation of U.S. involvement in the Spanish-American and Philippine-American Wars. Hoganson, Kristin L. (2000). Fighting for American Manhood: How Gender Politics Provoked the Spanish-American and Philippine-American Wars (Yale Historical Publications Series).
[2] Pennington, J., Socher, R. and Manning, Ch. D. (2014). GloVe: Global Vectors for Word Representation; David M. Blei and John D. Lafferty (2006) Dynamic Topic Models.
[3] This dataset is available as manually labeled xml formatted files on GitHub (https://github.com/HistoryAtState/frus), which makes the corpus more reliable and meta-data accessible than the OCR scanned files of documents from 1861-1947.