DH 2016 Abstracts

Morphology beyond inflection. Building a wordformation based dictionary for Latin

This short paper describes the initial phases of a Marie Curie Research Project, Word Formation Latin (WFL), developed at the Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione (CIRCSE), at the Università Cattolica del Sacro Cuore, Milan, Italy. The project consists in the compilation of a derivational morphological dictionary of the Latin language, which connects lexical elements on the basis of wordformation rules, through the use of computational linguistic methods.

In the past two decades there has been a considerable increase in the creation of computational language resources for the investigation of classical languages, which have updated the state of the art almost to the same level as that of the resources currently available for modern languages.

However, among the existing language resources, we currently lack, for Latin, a morphological derivational dictionary that connects lexical elements on the basis of Word Formation Rules ¹ (WFRs).

A first attempt at constructing a lexicon based on wordformation for Latin was made by Marco Passarotti and Francesco Mambrini in 2012 (Passarotti & Mambrini, 2012). The WFL project has been awarded funding to expand on these efforts.

The project has three main aims:

the enrichment of an existing morphological analyser for the Latin language, LEMLAT (Passarotti, 2004), with wordformation information, and the integration of data within a interface similar to Word Manager (Domenig & ten Hacken, 1992), which has been already applied to other modern languages (English, German, Italian);
the integration of the information extracted from the resulting derivational morphological dictionary into the morphological layer of annotation the Index Thomisticus Treebank (IT-TB); ²
offering the results of the project work via a user-friendly project website which will display the derivational morphological dictionary through a web based search interface. This will allow the lexicon to be accessed:
1. by single lexical entry, which will show both the ancestors and their derived words;
2. by morphological family; ³
3. by WFR.

The project relies on the automatic realisation of the linguistic resource both at the level of WFRs creation and to their application on the lexical items included in the morphological analyser LEMLAT. The LEMLAT lexical basis contains around 40.000 lemmas from three major Latin dictionaries ( Georges, 1913-1918; Gradenwitz, 1904; Glare, 1982). We conceived WFRs according to the so-called Item-and-Arrangement model (IA), which follows a morpheme-based approach to morphology. In IA, word forms are analysed as arrangement of morphemes according to the following three axioms:

Roots and affixes have the same status of morphemes (Baudoin’s single morpheme hypothesis);
They are dualistic, as they have both a form and a meaning (Bloomfield’s sign base morpheme hypothesis);
They are stored in the lexicon (Bloomfield’s lexical morpheme hypothesis).

The aim is to assign a WFR to each morphologically complex lemma (i.e. one morphologically derived from another lemma) and to link each complex lemma to its ancestor. The data are organised and presented according to a system similar to that for morphological dictionaries devised by Word Manager, in which relations between the members of the same morphological family are represented in a tree-graph.

WFRs are grouped in two classes: 1. compounding; 2. derivational. Derivational rules are divided in two categories: a. affixal (in its turn split into prefixal and suffixal), and b. conversive, a derivation process that does not imply any affix; these are manually defined.

This happens in two steps:

1. Phase A: Semi-automatic data-driven finding of WFRs:

lemmas are divided into two classes, according to their part of speech and inflectional category;
two lists are produced: an incipitarium and an explicitarium, where lemmas are ordered according to a traditional alphabetical order, or to a right-to-left alphabetical order respectively;
prefixal and suffixal rules are created from the two lists respectively, part of speech and inflectional category of the lemma(s) are manually assigned to each candidate rule.

2. Phase B: Application (and evaluation) of the WFRs resulting from Phase A, and creation of the “morphological families”. New rules are added in this phase by confrontation with data. Phase B is divided into two subtasks:

each complex lemma is assigned a WFR. This task is performed by assigning in semi-automatic fashion to each (possibly) complex lemma its most likely WFR according to the PoS of the lemma and the string of its initial (prefixal rules) and final (suffixal rules) characters;
morphological families are built.

All those (morphologically simple, or complex ⁴) lemmas that share the same invariable part are automatically assigned to the same morphological family.

Finally, the members of each family are automatically linked to each other according to their PoS, inflectional category and affixes by means of the WFR assignment (2.a). The morphologically simple (i.e. not derived) lemma member is assigned the role of ancestor of the family.

Phase A finds the WFR, Phase B applies the WFR to data, obtaining input and output lemmas for each WFR.

Phase A is not to be considered exhaustive, but exploratory: the recall of WFR identified in Phase A is not 100%. The aim in the first phase of the project is to refine the data by tagging the highest number of lexemes using data driven WFRs, which will be increasingly complex, covering most well known wordformation issues. ⁵ Given the high number of homographs in Latin, this automatic procedure is regarded as non-ultimate for building the morphological families. However, it is helpful as it provides filtered data that must be checked manually.

This is why we need Phase B during which, by comparison with the evidence given by data, we can identify the rules that were missed in phase A. Manual hard-coding will be necessary for those lemmas produced by poorly productive WFRs, or morphotactically obscure wordformation processes. Evaluation of the language resource is performed by manual checking data organised into homogeneous groups based on WFRs (coverage of rules) and stemming (coverage of morphological families). Precision and recall are used as evaluation metrics in order to calculate the rate of positive and negative cases.

To date, 118 WFRs have been found automatically. Around 50 of these rules, those showing a certain degree of morphological transparency, hence easier to obtain through the automatic finding in the input-output relation (e.g. derivational, verb-to-verb, prefixal, etc.), have been added to a SQL database, and resulted in the tagging of some 9000 morphologically complex lexemes.

The final resource will be both a standalone dictionary accessible through its own website, and interconnected with the Index Thomisticus.

The integration with the IT-TB will be operated through the embedding of the dictionary data within the morphological layer of annotation of the treebank, using TEI (Text Encoding Initiative) P5 conformant XML encoding to favour data exchange and linking to other lexical resources. The data resulting from the dictionary, once encoded in XML, will be applied to the IT-TB data.

Bibliography

Busa, R. (1988). Totius latinitatis lemmata. Milano: Istituto Lombardo - Accademia di Scienze e Lettere.
Domenig, M. & ten Hacken, P. (1992). Word Manager: A system for morphological dictionaries. Hildesheim: Olms.
Forcellini, A. (1771). Totius Latinitatis lexicon, consilio et cura Jacobi Facciolati opera et studio Aegidii Forcellini, lucubratum. Patavii: typis Seminarii, 4 voll.
Georges, K. E. (1913-1918). Ausführliches Lateinisch-Deutsches Handwörterbuch. Hannover: Hahn.
Glare, P. G. W. (1982). Oxford Latin Dictionary. Oxford.
Gradenwitz, O. (1904). Laterculi Vocum Latinarum. Leipzig.
Index Thomisticus Treebank. http://itreebank.marginalia.it/.
LEMLAT. http://www.ilc.cnr.it/lemlat/lemlat/index.html.
Lewis, C. T. and Short, C. (1969). A Latin Dictionary. Oxford: At the Clarendon press.
McGillivray, B. and Passarotti, M. (2009). "The Development of the Index Thomisticus Treebank Valency Lexicon". In Proceedings of LaTeCH-SHELT&R Workshop 2009. Athens, March 30.
Passarotti, M. (2004). "Development and perspectives of the Latin morphological analyser LEMLAT". In A. Bozzi, L. Cignoni and J. L. Lebrave (Eds.), Digital Technology and Philological Disciplines. Linguistica Computazionale, XX-XXI, pp. 397-414.
Passarotti, M. and Mambrini, F. (2012). "First Steps towards the Semi-automatic Development of a wordformation-based Lexicon of Latin", In Proceedings of LREC 2012. Istanbul, Turkey, pp. 852-59.
TEI Consortium, (eds). TEI P5: Guidelines for Electronic Text Encoding and Interchange. 2.9.1. 2015-10-15. TEI Consortium. http://www.tei-c.org/Vault/P5/current/ (accessed 22 February 2016).

Notes

Word formation is the creation of a new word from either the combination of two other words ( dish-washer, compounding) or of adding one of more affixes to an existing word ( wash-er, derivation), or from a part of speech change ( clean, verb vs. clean, adjective).

The Index Thomisticus (IT) is considered a pathfinder in digital humanities; started by Padre Roberto Busa in 1949. It is a database retaining the opera omnia by Thomas Aquinas (118 texts), plus works by other 61 authors related to Thomas (61 texts). The size of the corpus is around 11 million tokens (150.000 types; 20.000 lemmas). The corpus is fully lemmatised and morphologically tagged. The IT-TB, based at CIRCSE, is the syntactically annotated portion of the IT, and it contains around 300.000 tokens for 15.000 syntactically parsed sentences.

By “morphological family” we mean the set of lemmas morphologically derived from one common ancestor-lemma

WFRs do not take in input morphologically simple lemmas only, but also complex ones. For example, the noun excubatio derives by suffixation from the verb excubo, which is morphologically complex, as it is derived (by prefixation) from the verb cubo.

i.e. stem change featuring internal vowel alternation ( fac.io, per-fic-io), assimilation of prefix ( fer-o > *ob-fer-o > of-fer-o), unclear segmentation ( cre-a-tor or cre-at-or?), etc.