Internet is a revolution that will not stop “until everything is digitized”,
Louis Gerstner, former Chairman of IBM, quoted in the Economist, June 4th 1998
The goal of this workshop is twofold: First, to provide a venue for researchers to describe and discuss practical methods and tools used in the construction of semantically annotated text collections, the raw material necessary to build knowledge-rich applications. We expect such tools to include lexical and semantic resources with a focus on the interlinking of concepts and entities and their integration into corpora.
A second goal is to report on the on-going development of new tools for providing access to the rich information contained in large text collections. Semantic tools and resources, notably, are reaching a quality that makes them fit for building practical applications. They include ontologies, framenets, syntactic parsers, semantic parsers, entity linkers, etc. We are interested in examples of cases that make use of such advanced tools and their evaluation in the field of digital humanities, with a specific interest on multilingual and cross-lingual aspects of semantic processing of text.
The workshop will include one, possibly two, invited speakers of international reputation.
One of the consequences of the digital revolution is the gradual, but inexorable availability of all kinds of text in a machine-readable format. Libraries around the world scan their collections. Newspapers offer their articles on the web. Governments put their archives and laws online. A large part of what the human mind has produced: Literature, essays, encyclopedias, biographies, etc., is, or will be, accessible in a computerized form in a wide variety of languages. Within a few years, we can predict that (nearly) all text ever produced by humanity will be available in digital form: Either born digital or digitized from books, newspapers, archives, etc.
While digitization is well underway, turning the information contained in these texts into exploitable knowledge in the information society has become a major challenge as well as a major opportunity. IBM Watson and Google's knowledge graph are recent and spectacular achievements that show the significance of knowledge extraction from text. IBM Watson is a system that can answer questions in the US Jeopardy quiz show better than any human being. One of its core components is the PRISMATIC knowledge base consisting of one billion semantic propositions extracted from the English version of Wikipedia and the New York Times, while Google’s knowledge graph is based on a systematic extraction of millions of entities from a variety of sources. Such technologies are defining the information age, and they have the potential to bring a much higher degree of sophistication to "distant-reading" methodology in digital humanities, enabling large-scale access to text content.
The target audience is a mix of users that would like to apply semantic processing techniques to text and researchers in this area. Users, for instance, could be interested in the extraction of entities and their association with encyclopedic text or the extraction of relations from text: date and place of birth/death, profession, etc. Researchers would describe practical techniques and algorithms that could fit the needs of the users.