Sonderforschungsbereich 732:

Project B3-E (2006-2014)

Disambiguierung von Nominalisierungen bei der Extraktion linguistischer Daten aus Corpustext

Many words and syntactic constructions of natural language have several readings, i.e. they are ambiguous. Such ambiguities can cause problems in the automatic extraction of data from texts, as it is used, for example, in the creation of dictionaries and grammars, in Question Answering or Information Retrieval. This project aims at making data extraction aware of ambiguities, to increase the quality of the extracted data.

Data extraction is often faced with the following types of ambiguities:

  • lexical ambiguities: a word has several readings, depending on the context;
  • structural ambiguities: a grammatical construction may be interpreted in different ways, and thus a given sequence of words may or may not be an acceptable result of a given extraction query;
  • both may be interwoven, and structural ambiguity contributes to the interpretation problems at word level.

In this project, we work on lexical ambiguities in German nominalizations of verbs, especially those with the affix '-ung', such as 'Teilung', 'Anwendung', etc. Many such nouns are ambiguous between an event reading ('Teilung durchführen') and a state reading ('Teilung besteht'), or between an event reading ('Messung vornehmen') and a (result) object reading ('Messung (= Meßergebnis) liegt vor'), or between all three interpretations. As the examples show, a given interpretation (and thus the disambiguation) of an '-ung'-nominalization may be supported or enforced by lexical or grammatical indicators from the context. Examples of such indicators are verbs embedding the nominalization (see the examples above), adjectival modifiers, or certain types of prepositional phrases in the nominalization's context.

We aim at making data extraction aware of ambiguities: some need to be resolved, in order to get high quality extraction results; for others, it is sufficient to recognize them as having no impact on the extraction. Classifying and solving the ambiguities is only possible within the context of a sentence; in some cases (not analyzed in this project), more context is necessary.

In detail, our objectives include the development of the following components:

  • explicitly underspecified representations for structural and lexical ambiguities, and annotation schemes to represent these ambiguities in corpus text;
  • computational linguistic tool components for the syntactic analysis of candidate corpus data based on dependency parsing, which permit to create underspecified representations of ambiguities;
  • proposals for tool architectures which combine the tool components;
  • data collections with analyzed and classified instances of '-ung'-nominalizations (cf. a proposal on the basis of a relational database, poster presented at the DGfS-2009 (Osnabrück)), and tools to interpret these data, so as to be able to derive generalizations.

This project thus intends to contribute to the syntactic and semantic representation and to the handling of ambiguities in large corpora, and to the disambiguation of German nominalizations.

PI: Ulrich Heid (01.07.2006 - 30.09.2011), Jonas Kuhn
Researchers: Andre Blessing, Kerstin Eckart, Ina Rösiger