A robust system for the resolution of
coreferential relations in text
The goal of the proposed project is to develop a robust system for the
resolution of coreferential relations in text. Automatic, robust,
domain independent, coreference resolution systems typically operate
on top of or in combination with other NLP modules (such as a
POS-tagger, a chunker, or full syntactic analyzer), which provide
potential antecedents for a given nominal phrase. The task of the
resolution system is to select the most likely antecedent for the
current nominal phrase. This decision is typically made on the basis
of linguistic features of the potential antecedents, such as number
and gender information, grammatical role, the distance between the
current phrase and the potential antecedent, and ontological
information, such as animacy. Note that using grammatical role as a
source of information requires a shallow or full syntactic parser, and
using ontological information requires a lexical resource such as
WordNet.
Two different directions can be taken in research on computational
coreference resolution: a knowledge-based approach and a corpus-based
approach. Our proposed approach is corpus-based. Corpus-based
techniques have become increasingly popular for the resolution of
coreferential relations and was enabled by the creation of
coreferentially annotated corpora such as MUC-6 and MUC-7. For Dutch,
not much research has been done yet on automatic coreference
resolution. The use of a corpus-based strategy for Dutch coreference
resolution is still an unexplored research area. In this project, we
plan to develop a resolution system based on machine learning
techniques. By developing an automatic coreference resolution system
for Dutch, we make this technology available for intelligent
information processing systems which have to deal with Dutch text. We
aim to build a machine learning system which is reusable in a wide
range of applications, such as information extraction, question
answering and summerization. By developing and evaluating our system
in the context of realistic applications, we will ensure that the
resulting system can be used to obtain real performance improvements.
Corpus
Development
Corpora annotated with coreferential information are a prerequisite
for the development and evaluation of any resolution system. In the
current project, we hope to gain access to such corpora by reuse of
existing resources, and a limited amount of hand annotating new,
application oriented, material. As a side effect of this effort,
general annotation guidelines for coreference annotation will become
available, as well as tool for annotating coreference efficiently.
Our aim is to develop a coreference resolution system whose
performance is robust and accurate enough that it can be used in
applications. To evaluate this, we will also annotate a limited amount
of application specific data. In particular, we will develop a corpus
which is representative for resolving coreference in dialogue, and a
corpus which is representative for IE and QA tasks.
Back
|