This short paper describes the initial phases of a Marie Curie Research Project, Word Formation Latin (WFL), developed at the Centro Interdisciplinare di Ricerche per la Computerizzazione dei Segni dell’Espressione (CIRCSE), at the Università Cattolica del Sacro Cuore, Milan, Italy. The project consists in the compilation of a derivational morphological dictionary of the Latin language, which connects lexical elements on the basis of wordformation rules, through the use of computational linguistic methods.
In the past two decades there has been a considerable increase in the creation of computational language resources for the investigation of classical languages, which have updated the state of the art almost to the same level as that of the resources currently available for modern languages.
However, among the existing language resources, we currently lack, for Latin, a morphological derivational dictionary that connects lexical elements on the basis of Word Formation Rules 1 (WFRs).
A first attempt at constructing a lexicon based on wordformation for Latin was made by Marco Passarotti and Francesco Mambrini in 2012 (Passarotti & Mambrini, 2012). The WFL project has been awarded funding to expand on these efforts.
The project has three main aims:
The project relies on the automatic realisation of the linguistic resource both at the level of WFRs creation and to their application on the lexical items included in the morphological analyser LEMLAT. The LEMLAT lexical basis contains around 40.000 lemmas from three major Latin dictionaries ( Georges, 1913-1918; Gradenwitz, 1904; Glare, 1982). We conceived WFRs according to the so-called Item-and-Arrangement model (IA), which follows a morpheme-based approach to morphology. In IA, word forms are analysed as arrangement of morphemes according to the following three axioms:
The aim is to assign a WFR to each morphologically complex lemma (i.e. one morphologically derived from another lemma) and to link each complex lemma to its ancestor. The data are organised and presented according to a system similar to that for morphological dictionaries devised by Word Manager, in which relations between the members of the same morphological family are represented in a tree-graph.
WFRs are grouped in two classes: 1. compounding; 2. derivational. Derivational rules are divided in two categories: a. affixal (in its turn split into prefixal and suffixal), and b. conversive, a derivation process that does not imply any affix; these are manually defined.
This happens in two steps:
1. Phase A: Semi-automatic data-driven finding of WFRs:
2. Phase B: Application (and evaluation) of the WFRs resulting from Phase A, and creation of the “morphological families”. New rules are added in this phase by confrontation with data. Phase B is divided into two subtasks:
All those (morphologically simple, or complex 4) lemmas that share the same invariable part are automatically assigned to the same morphological family.
Finally, the members of each family are automatically linked to each other according to their PoS, inflectional category and affixes by means of the WFR assignment (2.a). The morphologically simple (i.e. not derived) lemma member is assigned the role of ancestor of the family.
Phase A finds the WFR, Phase B applies the WFR to data, obtaining input and output lemmas for each WFR.
Phase A is not to be considered exhaustive, but exploratory: the recall of WFR identified in Phase A is not 100%. The aim in the first phase of the project is to refine the data by tagging the highest number of lexemes using data driven WFRs, which will be increasingly complex, covering most well known wordformation issues. 5 Given the high number of homographs in Latin, this automatic procedure is regarded as non-ultimate for building the morphological families. However, it is helpful as it provides filtered data that must be checked manually.
This is why we need Phase B during which, by comparison with the evidence given by data, we can identify the rules that were missed in phase A. Manual hard-coding will be necessary for those lemmas produced by poorly productive WFRs, or morphotactically obscure wordformation processes. Evaluation of the language resource is performed by manual checking data organised into homogeneous groups based on WFRs (coverage of rules) and stemming (coverage of morphological families). Precision and recall are used as evaluation metrics in order to calculate the rate of positive and negative cases.
To date, 118 WFRs have been found automatically. Around 50 of these rules, those showing a certain degree of morphological transparency, hence easier to obtain through the automatic finding in the input-output relation (e.g. derivational, verb-to-verb, prefixal, etc.), have been added to a SQL database, and resulted in the tagging of some 9000 morphologically complex lexemes.
The final resource will be both a standalone dictionary accessible through its own website, and interconnected with the Index Thomisticus.
The integration with the IT-TB will be operated through the embedding of the dictionary data within the morphological layer of annotation of the treebank, using TEI (Text Encoding Initiative) P5 conformant XML encoding to favour data exchange and linking to other lexical resources. The data resulting from the dictionary, once encoded in XML, will be applied to the IT-TB data.
Word formation is the creation of a new word from either the combination of two other words ( dish-washer, compounding) or of adding one of more affixes to an existing word ( wash-er, derivation), or from a part of speech change ( clean, verb vs. clean, adjective).
The Index Thomisticus (IT) is considered a pathfinder in digital humanities; started by Padre Roberto Busa in 1949. It is a database retaining the opera omnia by Thomas Aquinas (118 texts), plus works by other 61 authors related to Thomas (61 texts). The size of the corpus is around 11 million tokens (150.000 types; 20.000 lemmas). The corpus is fully lemmatised and morphologically tagged. The IT-TB, based at CIRCSE, is the syntactically annotated portion of the IT, and it contains around 300.000 tokens for 15.000 syntactically parsed sentences.
By “morphological family” we mean the set of lemmas morphologically derived from one common ancestor-lemma
WFRs do not take in input morphologically simple lemmas only, but also complex ones. For example, the noun excubatio derives by suffixation from the verb excubo, which is morphologically complex, as it is derived (by prefixation) from the verb cubo.
i.e. stem change featuring internal vowel alternation ( fac.io, per-fic-io), assimilation of prefix ( fer-o > *ob-fer-o > of-fer-o), unclear segmentation ( cre-a-tor or cre-at-or?), etc.