Buntinx, V., Bornet, C., Kaplan, F. (2016). Studying Linguistic Changes on 200 Years of Newspapers. In Digital Humanities 2016: Conference Abstracts. Jagiellonian University & Pedagogical University, Kraków, pp. 751-752.
Studying Linguistic Changes on 200 Years of Newspapers

# Studying Linguistic Changes on 200 Years of Newspapers

## 1. Newspaper archives as a linguistic corpus

This research investigates methods to study linguistic evolution using a corpus of scanned newspapers. We use a corpus of 4 million press articles covering about 200 years of archives, thus documenting indirectly the evolution of written language. The corpus is made out of digitized facsimiles of Le Journal de Genève (1826–1997) and La Gazette de Lausanne (1804–1997). For each journal, the daily scanned issues were algorithmically transcribed using an OCR system. The whole archive represents more than 20 TB of scanned data and contains about two billion words, putting it beyond the capabilities of most usual analysis techniques for regular desktop computers.

The corpus can be easily divided into subsets corresponding to the year of publication. However, the number of pages and their content fluctuates greatly depending on the year, ranging from 280’000 in the early 19th century to about 18 million in the later years of the 20th century. Figure 1 shows the relative size of each subset in terms of number of words for Le Journal de Genève (JDG) and La Gazette de Lausanne (GDL).

Figure 1: corpus size versus years for GDL (top) and JDG (bottom).

Considering the lack of data for Le Journal de Genève for the years 1837, 1917, 1918 and 1919, we left those out in all further graphs and analytics. In addition, some years had to be removed because the scanning quality was too poor (1834, 1835, 1859 and 1860 for JDG and 1808 for GDL).

## 2. Lexical kernels: Definition and basic measures

A straightforward approach to the problem consists in computing a textual distance between subsets of the corpora. One could, for instance, easily compute the so-called Jaccard distance (Jaccard 1901, Jaccard 1912) between two consecutive vocabularies. In the same way, other distances could also be tried, such as those given by Kullback and Leibler (1951), Kullback (1987), Chi-squared distance (Sakoda, 1981), and Cosine similarity (Singhal, 2001).

However, the uneven distribution of the corpus subsets (Figure 1) causes methodological difficulties for interpreting these distances. An increase in the lexicon size causes an indirect increase in the linguistic drift as measured by the Jaccard formula (Sternitzke and Bergmann, 2009). Under such conditions, it is difficult to untangle the effects of the unevenness of the distribution of subsets of the corpus from the actual appearance and disappearance of words.

These difficulties of interpretation motivate the exploration of another, possibly sounder approach to the same problem. Let us define a lexical kernel ${K}_{x,y,C}$ as the sequential subset of unique words common to a given period starting in year x and finishing in year y of a corpus C. ${K}_{1804,1998,GDL}$ is, for instance, the subset of all words present in the yearly corpus of La Gazette de Lausanne. It contains 5242 unique words that have been used for about 200 years. The kernel ${K}_{1826,1998,JDG}$ contains 7486 unique words, covering a period of about 170 years. As the covered period is smaller, the kernel is naturally larger.

The exact contents of both ${K}_{1804,1998,GDL}$ and ${K}_{1826,1998,JDG}$ are provided in the appendix. It is interesting to note that 4465 words are in common between the two kernels.

Figure 2 shows the statistical distribution of word-typologies for both kernels.

Figure 2: Distribution in terms of typologies of words contained in the kernel of ${K}_{1798,1998,GDL}$ (left) and ${K}_{1826,1998,JDG}$(right).

## 3. Word resilience

Extending the notion of a kernel, it is rather easy to study the resilience of a given word. Let ${R}_{d}$ be the union of all words contained in a kernel corresponding to a duration of $y-x\ge d$ years. For instance, ${R}_{100}$ contains all the words that maintain themselves in the corpus for at least 100 years. R subsets are organized as concentric sets: ${R}_{1}\subset {R}_{2}\subset ...\subset {R}_{i}\subset {R}_{i+1}$

The relative proportion of each subset sheds light on both the stability and dynamics of language change. Figure 3 shows the distribution of word resilience for both journals.

Figure 3: Size of ${R}_{d}$ versus the number of maintained years $d$ (logarithmic scale) showing the word resilience distribution for JDG (green) and GDL (blue).

The GDL resilience curve is normalized (on the same years range as JDG) in order to make the two curves comparable. This representation of ${R}_{d}$ shows a similar global word resilience trend for both JDG and GDL. However, we notice that the two curves intersect when considering the longest durations.

## 4. Discussion

Large databases of scanned newspapers open new avenues for studying linguistic evolution. However, these studies should be conducted with sound methodologies in order to avoid misinterpretation of artifacts. Common pitfalls include misinterpreting results linked to the size variation of the subsets or overgeneralizing results obtained on one particular newspaper corpus to general linguistic evolution.

In this paper, we have introduced the notion of a kernel as a possible approach to study linguistic changes under the lens of linguistic stability. Focusing on stable words and their relative distribution is likely to make interpretations more robust.

Results were computed on two independent corpora. It is striking to see that most of the results obtained are extremely similar for both. The kernels composition in terms of grammatical word typologies is very similar. Results in terms of word resilience are also similar. This suggests that our methods are indeed measuring general linguistic phenomena beyond the specificity of the corpora chosen for this study. However, this still needs to be confirmed with subsequent studies involving other corpora, such as non-journalistic texts and texts in other languages.