Sonderforschungsbereich 732:

Project D5-E (2006-2010)

Biased Learning for Syntactic Disambiguation 

D5 investigates syntactic disambiguation as a particular type of specification. From the ambiguous output of a treebank-trained parser (delivered in the form of the n best parses), we identify the most likely parse based on contextual knowledge. All lexical, syntactic, semantic, pragmatic and world knowledge that a listener can use to interpret an utterance is viewed
as context. One of the aims of the project is to model this linguistic and extralinguistic knowledge in the form of a database compiled from large unannotated text corpora.

As an example of how D5 will bring contextual information extracted from corpora to bear on the problem of disambiguation consider the phrase "an opening under the house that led to a fume-filled coal mine". In an unannotated corpus, one can find similar contexts that mention an
opening leading to a coal mine, but none that talk about a house leading to a coal mine. This suggests that the relative clause attaches to "opening", not to "house".

The project uses machine learning methods that are informed and guided by linguistic knowledge. We call these methods biased learning.  Biased learning will be used for merging syntactic knowledge from the treebank parser with contextual knowledge from the database using the framework of Exemplar Theory. In this framework two similarity measures will be defined which, given an ambiguous parse, determine the most similar fragments in the exemplar database compiled from the corpus. The disambiguation decision will be based on these similar fragments. The first similarity measure is based on language models, the second on dependency structures. For the acquisition of the exemplar database both monolingual and multilingual parallel corpora will be exploited.

This project intends to contribute to the goals of the SFB by determining the effect of context on syntactic disambiguation; by investigating the different effects of contextual information acquired
by means of shallow vs. deep analysis; by investigating the effect of contextual information on disambiguation quantitatively; by investigating the effect of domain "language" on syntactic disambiguation (English vs. German); by investigating the learnability of contextual
information from monolingual and multilingual corpora; and by showing that complex statistical models based on Exemplar Theory can provide meaningful linguistic explanations.