DH 2016 Abstracts

Poetry in Prose: automatic identification of verses in brazilian literature

In 1946, the brazilian poet Guilherme de Almeida published a study on the structured patterns of verses that he discovered in the prose of ‘Os Sertões’ (‘Rebellion in the Backlands’) by Euclides da Cunha (1902). According to Almeida’s work, there is, in the Euclidean prose, apparently more often at the end of the paragraphs, versification structures of various rhythmic patterns. In 1996, another study on the same literary work was published by Augusto de Campos validated Almeida’s discovery and revealed several others versified patterns in Euclidean prose. Dodecasyllables and Alexandrines are among the most used metric patterns, in varied combinations and positions. The diversity of patterns found, disregarding “the strict metrification” and admitting “more rhythmic freedom” (Campos, 1996), creates surprising zones of tension, “areas spread with poetry in significant portions of poetry in his prose” (Campos, 1996).

The process of separating and nesting of poetry syllables, the scansion process, is usually applied to text structures categorically defined as poetry, allowing mapping of poetry metric characteristics present in the author’s writing. Performed by a person, the same analysis carried out by Almeida and Campos require, depending on the size of the piece, hours, days or even months of work.

Our work proposes the use of computational techniques to perform the process of automated scansion and analysis of Euclides da Cunha’s prose, revealing its verse structures, thus reducing time for the task, providing a new tool for prose analysis and opening a new research agenda. These verse structures, distributed along the text, are found using computational methods based on scansion rules for Portuguese language. As the location of these structures are not previously given, any sentence is treated as a potential candidate for a verse, moreover segments of the sentences can also be considered.

In order to identify metric verses in the text, our system performs four major steps: extraction of sentences, separation of syllables, scansion, and overlay of verses in the original text. From a digital copy of the book, sentences are extracted according to punctuation mark present in literary piece. In Portuguese, the rhythm or musicality of a verse follows the alternance of strong (tonic) syllables and weak (atonic) syllables, so along with syllable boundaries, the position of the tonic syllable is also identified for every word. Therefore, every word in each sentence undergoes syllable separation following grammatical rules, applying the software developed by Neto et al (2015), defining initial syllable boundaries. Besides the positions of tonic syllables are also identified, determining rhythmic features.

To identify verses, the final process of scansion is performed, considering intravocabular (syneresis and diaeresis) and intervocabular (elision and crasis) phonological changes. These changes may alter initial syllable count, for example with the omission of one or more sounds, merging two syllables in a single one. As a final output, the verses identified are overlaid on the original document, along with verse classification, metric count, syllable separations and tonic syllables position, replacing the original sentence, allowing analysis by the user in context.

Initial experiments with the proposed system were performed for the book ‘Os Sertões’ by Euclides da Cunha, aiming to reproduce in a computer lab the work performed by Guilherme de Almeida in the 40s and Augusto de Campos in the 90s. As an example of the results, in page 67, the system identified previously twenty-four candidates verses. Of these, we have, "O sertanejo é, antes de tudo, um forte", one segment of text that starts the third chapter, identified by our system as a dodecasyllable "O / ser/ta/ne+/jo é+,/ an+/tes/ de/ tu+/do/ um/ for+/te", where ‘+’ identifies a tonic syllable, and, starting the third paragraph, "É o homem permanentemente fatigado.'', a autonomous paragraph which was also identified by our system as a dodecasyllable, “É+ / o ho+/mem / per/ma/nen/te/men+/te / fa/ti/ga+/do.”. Both verses were indicated by Augusto de Campos in his work. Overall, among dodecasyllables and decasyllables, considering only whole sentences as a possible verse, our system was able to identify 273 verses, many of them already manually identified by Almeida or Campos. Nevertheless, previous works on the annotation of verses in the book ‘Os Sertões’ were not comprehensive annotations, so it is not possible to have full statistics on accuracy. Nevertheless, our system has exactly identified 52% of Augusto de Campos’s annotated verses, and the other 48% of the verses were identified with a difference of one unit in syllable count, due to differences in elision.

Bibliography

Almeida, G. (1946). A poesia d’Os Sertões. Diário de São Paulo. August 18.
Neto, N., Rocha, W. and Sousa, G. (2015). An open-source rule-based syllabification tool for Brazilian Portuguese. Journal of the Brazilian Computer Society,21(1): 1-10.
Campos, A. (1996). TRANSERTÕES. Folha de São Paulo. November 3.