A robust system for the resolution of coreferential relations in text

The goal of the proposed project is to develop a robust system for the resolution of coreferential relations in text. Automatic, robust, domain independent, coreference resolution systems typically operate on top of or in combination with other NLP modules (such as a POS-tagger, a chunker, or full syntactic analyzer), which provide potential antecedents for a given nominal phrase. The task of the resolution system is to select the most likely antecedent for the current nominal phrase. This decision is typically made on the basis of linguistic features of the potential antecedents, such as number and gender information, grammatical role, the distance between the current phrase and the potential antecedent, and ontological information, such as animacy. Note that using grammatical role as a source of information requires a shallow or full syntactic parser, and using ontological information requires a lexical resource such as WordNet.

Two different directions can be taken in research on computational coreference resolution: a knowledge-based approach and a corpus-based approach. Our proposed approach is corpus-based. Corpus-based techniques have become increasingly popular for the resolution of coreferential relations and was enabled by the creation of coreferentially annotated corpora such as MUC-6 and MUC-7. For Dutch, not much research has been done yet on automatic coreference resolution. The use of a corpus-based strategy for Dutch coreference resolution is still an unexplored research area. In this project, we plan to develop a resolution system based on machine learning techniques. By developing an automatic coreference resolution system for Dutch, we make this technology available for intelligent information processing systems which have to deal with Dutch text. We aim to build a machine learning system which is reusable in a wide range of applications, such as information extraction, question answering and summerization. By developing and evaluating our system in the context of realistic applications, we will ensure that the resulting system can be used to obtain real performance improvements.

Corpus Development

Corpora annotated with coreferential information are a prerequisite for the development and evaluation of any resolution system. In the current project, we hope to gain access to such corpora by reuse of existing resources, and a limited amount of hand annotating new, application oriented, material. As a side effect of this effort, general annotation guidelines for coreference annotation will become available, as well as tool for annotating coreference efficiently. Our aim is to develop a coreference resolution system whose performance is robust and accurate enough that it can be used in applications. To evaluate this, we will also annotate a limited amount of application specific data. In particular, we will develop a corpus which is representative for resolving coreference in dialogue, and a corpus which is representative for IE and QA tasks.

Back