Doing comparative researches on a large literary corpus is often lengthy and demanding. With the digitization of contents, scholars have started using computers. Semantic information is essential for an appropriate understanding of texts and is an extremely important factor in cross-textual analysis. This is why we have developed a new semantic engine built especially for Digital Humanities, called DeSeRT 1. This paper presents some of the results that were obtained from a corpus of the 17 th and 18 th century texts characteristic of the debates about theater in the classical age. The following paper is divided into four parts: after a first section that briefly describes the search engine used and the different investigations it allows, a second section introduces the corpus used. Then, the third section shows some of the results obtained.
DeSeRT has been designed to identify and compare rewriting, paraphrasing or reformulation. It is based on an idea, already developed (Barron-Cedeño et al., 2013; Ferrero and Simac-Lejeune, 2015), according to which, even if the reformulations cannot be reduced to paraphrases, they retain the meaning of original texts by using either the same words or words of similar meaning. As a consequence, the detection of co-occurrences of a few semantically equivalent lemmas in small blocks of texts is sufficient to capture the equivalent meaning and, therefore, to identify the reformulation of the same ideas.
This is implemented in four steps:
More precisely, the Okapi similarity measure of a block B with a set of terms T = t 1 , … t n is given by the following formula:
where f(t i , B) is the frequency of the term t i in the block B, | B| is the size of block B, k and b are two free parameters chosen in advance (usually chosen as k∈[1.2, 2] and b=0.75), avdl is the average document description length in the collection and IDF(t i ) is the Inverse Document Frequency of term t i is given by:
It appears that, when the size of each block is the same and when the frequency of terms in blocks is supposed to be at max 1 (this approximation is justified since the blocks are small), this formula can be simplified. It then becomes equivalent to the information theoretic measure of the terms in blocks, i.e.
where Pr(t i ) is the probability of the term t i in the overall corpus.
Using this score, it is possible to measure the similarity between blocks or between a block and a set T of terms t i. As a consequence, DeSeRT can be used in different ways. The research queries may be done through words or concepts, i.e. words that are expanded using the dictionary of synonyms. It is also possible to compare any text (or file) to a corpus: then, DeSeRT detects corpus blocks where the meaning is similar to blocks of the given text, which allows, for instance, the arguments in a dispute or the anecdotes and the common places that are reused to be followed. Lastly, it is also possible (but not mandatory) to add a thesaurus or ontology to focus the search on a given semantic field.
Note that, based on techniques developed to detect plagiarism, many tools already exist that are designed to identify paraphrases, reuses and borrowings, i.e. sequences of words that are approximately identical, e.g. (Ganascia et al. 2014; Horson et al. 2010). However, these techniques are unable to spot reformulations of the same ideas or allusions to previous texts. DeSeRT has been designed to overcome these limitations.
The project Haine du Théâtre 4 (“The Hatred of Theater” in French) aims to analyze theater disputes in Europe using scientific approaches and critical editions of polemical texts. The team’s reflections are mainly focused around the discovery of the circumstances and the arguments used in theater controversies all across Europe, not limited to France, but also in England, Spain, Italy, and the Germanic area, from the last decades of the 16 th century up to the beginning of the 19 th century.
The corpus of the project collects many texts written in French during the 17 th and the 18 th centuries. The purpose of this project is to explore the gray areas of theater controversies in order to outline a global overview and to discover where and how the polemics began, their chronological discrepancies and the links between them and their contemporary resurgences. The total collection of the Haine du Théâtre texts is, by now, made up of 27 texts.
Querying the 27 texts of the Haine du Théâtre corpus, we found much reuse of similar passages and texts. DeSeRT is not only useful for detecting those parts of text that deal with the same concept, but is also a very good tool to find borrowings.
For example, comparing two texts, the Défense of Voisin (1666) and Traité de la Comédie of Nicole (1667), we discover immediately that the Traité has been included in the Défense by Voisin, which is a very long text, not only once, but twice. The first time, Voisin presents it as a re-publication, then he re-uses phrases similar to those employed by Nicole in different passages and he sprinkles them in his Défense.
The keywords of this correspondence are very well detected by DeSeRT, as can be seen in the example below.
Furthermore, continuing the analysis of the text we discover that in these two texts the actor is frequently associated with the idea of purity in religion.
DeSeRT also shows in detail the parts of the corpus that are similar or that develop the same ideas. This may either be done on demand, according user requests, i.e. to given texts, or to the overall corpus, which automates the process.
For instance, we have found many topics common not only to the texts by Nicole and Voisin, but also to two others e.g. the theme of the idolatry as the “mother of all spectacles” in (Aubignac, 1666) and (Conti, 1666).
Note that, in the following figures, we have greyed the identical passages with the MEDITE system, which only spots strict homologies, without considering many words that appear identical to DeSeRT because they correspond to the same lemma or to two synonymous lemmas. As a consequence, the number of gray zones considerably underestimates the semantic proximity detected by DeSeRT.
Secondly, as shown in the following figures, the topic “renouncing the Devil” ( Renoncer au Diable in French) is regularly present in the texts written by Aubignac, Conti and Voisin, 5 while it appears only once in (Nicole, 1667: 477).
Further results of DeSeRT lead us to understand that many others expressions are common to the four authors, such as the description of the theatre as a “flesh of pestilence” ( chair de pestilence) or as a “school of the debauchery”.
To conclude, DeSeRT allows the discovery of reformulations and similar phrases, as well as related topics and passages in a corpus. It is usable on any kind of corpus that is made up by files in the txt format. As we have briefly shown in this paper, this search engine enables users to identify crucial passages of a specific corpus according to two types of detection: the discovery of reuses, such as plagiarism or hidden rewritings, and the reformulation of ideas, which can be manually given by the user (as words or concepts) or automatically extracted by the DeSeRT search engine.
A French version of DeSeRT is freely available online at http://obvil-dev.paris-sorbonne.fr/desert/
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/
http://sqlite.org/
D’Aubignac, 1666 : 59, 62, 65, 72, 73, 74, 79, 217. Conti, 1666 : 88, 105, 120, 144, 173, 182, 184. Voisin, 1671 : 59, 86, 88, 97, 113, 114, 124, 165, 205, 212, 228, 407, 427, 433, 451, 463, 481.