DH 2016 Abstracts

Diachronic Semantic Lexicon of Dutch (Diachroon semantisch lexicon van de Nederlandse Taal; DiaMaNT)

Dutch language has been described extensively in the comprehensive historical dictionaries of the Institute for Dutch lexicology. These dictionaries (Oudernederlands Woordenboek, Dictionary of Old Dutch, ca. 500-1200; Vroegmiddelnederlands Woordenboek, Dictionary of Early Middle Dutch, 1200-1300 ; Middelnederlandsch Woordenboek; MNW, Dictionary of Middle Dutch, ~1250-550; Woordenboek der Nederlandsche Dictionary of the Dutch Language, 1500-976) cover over 15 centuries of Dutch and are as such a perfect guide to understanding historical language. The dictionaries also provide the core material for the diachronic computational lexicon of Dutch (GiGaNT), that can be used to support search in historical texts by users without (expert) knowledge of historical spelling variation: when searching for slager (‘butcher’) the user also gets the morphological and spelling variants like slagers, slagher(s), slaeger(s) slaegher(s) or slegher(s). However, when a user wants to study the history of the butcher’s trade, it is not immediately obvious from the way these traditional dictionaries are structured that one has also to look for vleeschhouwer or beenhouwer or beenhakker. And it is only after reading the complete articles that a user learns that vleeschouwer can also mean ‘executioner’, and slager ‘a person who slays so.’, be it though that in the case of vleeschhouwer the meaning ‘executioner´ is derived from vleeschhouwer ‘butcher’, while slager in contemporary meaning ‘butcher’ is derived from the meaning ‘a person who slays so’.

In this contribution we describe the first results of our work on the development of a diachronic semantic lexicon of Dutch. The lexicon aims to enhance text accessibility and to foster research in the development of concepts, by interrelating attested word forms and semantic units (concepts), and tracing semantic developments in time. In the lexicon, the diachronic onomasiology, i.e. the change in naming of concepts and the diachronic semasiology, i.e. the change in meaning of words, will be recorded in a way suitable for use by humans and computers. The onomasiological part of the lexicon is meant to enhance recall in text retrieval by providing different verbal expressions of a concept or related concepts (slager → beenhouwer, beenhakker, vleeshouwer; boer → landman). The diachronic semasiological component (which charts semantic change), aims to enhance precision by enabling the user to take semantic change into account; the oldest meaning of appel for example is ‘a fruit’ (so appel is also used for pears, plums etc.).

We describe the structure of the diachronic semantic lexicon and procedures for the acquisition and aggregation of content. The INL historical dictionaries will be the main source of the lexicon, as these dictionaries describe the Dutch lexicon from the 6 ^th to the 20 ^th century and cover most of the basic vocabulary of this period. Word sense descriptions are illustrated by dated quotations, which constitute a first step towards dating a concept. The temporal distribution of quotations pertaining to different senses gives a first picture of the diachronic development of the sense inventory of a headword. The fact that many words in the historical dictionaries are defined (partly) by synonym definitions and contemporary semantic (near)-equivalents enables us to extract an initial set of semantic relations.

Information from other sources is not disregarded. For contemporary Dutch, several lexical resources cataloguing semantic relationships are available. This includes traditional synonym dictionaries like Brouwers “Het Juiste woord” and more recent initiatives such as Open Dutch Wordnet (Vossen). For some specific domains, thesauri with a diachronic component are in development (eg. the HISCO ( http://historyofwork.iisg.nl/index.php)).

Besides lexical sources, diachronic corpus material ¹ and corpus-based methods are no less essential to the development and verification of the relevance of the lexicon content. This includes: i) corpus based analysis of semantic change at the “type”-level, using distributional methods. Here, the fact that our starting point is defined by the set of quotation dates per word sense provides an interesting perspective. ii) research into the application of token-based distributional methods to the interlinking of historical corpora and lexical resources.

Bibliography

Fellbaum, C. ed. (1999). WORDNET. An Electronic Lexical Database. London: The MIT Press.
Geeraerts, D., et al. (1994). The Structure of Lexical Variation. Meaning, Naming, and Context. Berlin/New York: Mouton de Gruyter.
Geeraerts, D. (1997). Diachronic Prototype Semantics. A Contribution to Historical Lexicology. Oxford: Clarendon Press.
Geeraerts, D. (2010). Theories of Lexical Semantics. Oxford/New York: Oxford University Press.
Gulordava, K. and Baroni, M. (2011). A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. Proceedings of the EMNLP 2011 Geometrical Models for Natural Language Semantics (GEMS 2011) Workshop, pp. 67-71.
Heylen, K., et al. (2015). Monitoring polysemy: Word space models as a tool for large-scale lexical semantic analysis. Lingua, 157: 153-72.
Kay, C. J. and Chase, T. J. P. (1987). Constructing a Thesaurus database. Literary and Linguistic computing, 2(3): 161-63.
Laurence, S. and Margolis, E. (1999). Concepts and Cognitive Science. In Margolis, E. and Laurence, S., Concepts. Core Readings. Cambridge (US)/London: The MIT Press, pp. 3-81.
Sijs, N. van der (2001). Etymologie in het digitale tijdperk. Een chronologisch woordenboek als praktijkvoorbeeld. Ph.D. thesis, Universiteit Leiden.
Vanhove, M. ed. (2008). From Polysemy to Semantic Change. Towards a typology of lexical semantic associations. Amsterdam/Philadelphia: John Benjamins Publishing Company.
Vossen, P. ed. (1998). EuroWordNet: A mulitlingual database with lexical semantic networks. Reprinted from Computer and the Humanities, Vol. 32, Nos. 2-3, 1998. Dordrecht/Boston/London: Kluwer Academic Publishers.

Notes

Corpora: DBNL (digital library of Dutch literature, http://www.dbnl.nl), digitized newspaper collections at the Dutch Royal Library, and other collections digitized by the Royal Library ( http://www.delpher.nl).