DH 2016 Abstracts

Testing Delta on Chinese Texts

Delta (Burrows, 2002) is a measure, which has already been proven as a reliable method to resolve authorship attribution problems in different languages such as English and German. However, there has been no report about the accuracy of Delta on Chinese texts so far. As such, I set an experiment to test it. The tests cover both modern and classical Chinese because of the grammatical and lexical differences between them.

First I determined whether Delta works on modern Chinese. After that I did tests on classical Chinese. At last, I tested the Dream of the Red Chamber (DRC, 红楼梦) ¹ . The number of authors of DRC is a classic question of Chinese literary studies. The tool I used in the experiments is “stylo”, an R package introduced in the context of stylometry in 2013 (Eder, 2013). Using “stylo” I have done cluster analysis. All texts of one author should stay in one group. Misplaced texts are considered as mistakes. The more mistakes Delta makes, the less Delta is appropriate for Chinese.

Working on Chinese language processing is different compared to languages like English. The greatest challenge lies that we are unable to recognize the boundary of words because there are no spaces between words. There are two possibilities to address this problem: (i) by using a segmenter to split a text into words and select words as the textual feature, or (ii) by selecting character N-grams as the feature. Both solutions were tested here and the results are presented as a comparison.

For my first experiment I gathered 45 modern Chinese texts from 6 authors. I used the Stanford segmenter to split the texts and select both words and characters as features. The results showed that Delta is reliable (Fig. 1). With the 100 most frequent words bigrams Delta correctly identifies 38 of 45 texts. The best results, 43 of 45 texts, occur with the 200 to 700 most frequent character bigrams or most frequent words unigram.

Fig. 1 Delta test results for four sets of features in 45 modern Chinese texts

After the tests on modern Chinese, I proceeded with my second experiment on classical Chinese. I took 4 chapters each randomly from 10 novels from the Ming and Qing Dynasties (16th to 19th century) and built a corpus of 40 documents. One problem was that the Stanford segmenter did not work anymore, because the segmentation standards are not suitable for classical Chinese. Hence the only option was to take characters as feature. The results showed that Delta also works (Fig. 2). While many mistakes occurred with characters trigrams, taking characters bigrams for the tests achieved a high level of accuracy. With 600 most frequent characters 39 of 40 documents were correctly identified.

Fig. 2 Delta test results for three sets of features in 40 classical Chinese documents

The first two experiments confirmed Delta as a valid measure for both modern and classical Chinese. In the third experiment Delta was applied to Dream of Red Chamber (DRC) ². As one of the most famous Chinese classic novels, DRC was written by Cao Xueqin (曹雪芹) during the 18th century. The first version had 80 chapters. However in 1791 Gao E (高鹗) and Cheng Weiyuan (程伟元) published another edition with 120 chapters. They claimed that theirs was the complete version. Since there, there has been a constant debate about the number of authors of DRC. Some scholars think that Cao penned all 120 chapters. Some beg to differ. According to a study by Hu Shih (胡适) (Hu, 1998), the first 80 chapters of DRC are the original work of Cao and the last 40 chapters are written by Gao. Hu’s research is now widely accepted in China. Some modern research approaches also suggest that the first 80 chapters and the last 40 chapters of DRC are written by two different authors. They also find evidence that Chapters 64 and 67 may not be written by Cao (Hu, 2014; Tu, 2013).

My experiment suggested the same conclusion as the other scholars that DRC is written by two different authors (Fig. 3). The texts were divided into two groups. Red texts represent the first 80 and green texts are the rest 40 chapters. Delta also suggests that Chapter 67 is written by the second author.

Fig. 3 Delta test results of DRC, (600 MFC, 2-grams)

Bibliography

Burrows, J. (2002). ‘Delta’: A measure of stylistic difference and a guide to likely authorship, Literary and Linguistic Computing, 17(3): 267-87.
Eder, M., Kestemont, M. and Rybicki, J. (2013). Stylometry with R: a suite of tools, Digital Humanities 2013: Conference Abstracts pp. 487-89.
Hu Shhi (1988). 胡适红楼梦研究论述全编 [Hu Shihs Analysis of Dream of Red Chamber], Shanghai Guji Chubanshe (Shanghai Classics Publishing House).
Hu, X., Wang, Y. and Wu, Q. (2014). Multiple Authors Detection: A Quantitative Analysis of Dream of the Red Chamber, Advances in Adaptive Data Analysis, 1450012.
Tu, H. C. and Hsiang, J. (2013). A Text-Mining Approach to the Authorship Attribution Problem of Dream of the Red Chamber, Digital Humanities 2013: Conference Abstracts, pp. 441-43.

Notes

I focussed on the Classic Delta in my work. Other variations of Delta like Eder’s Delta, Argamon’s Linear Delta and so on will not be tested.

According to Tu’s paper (2013) the DRC under http://cls.hs.yzu.edu.tw/hlm/read/TEXT/TEXT.ASP is „the closest to the earliest editions“, which was taken for my study.