Optimization Issues in
Machine Learning of Coreference Resolution
Véronique Hoste
Complete
manuscript
pdf (1.2M)
ps (2.0M)
Title page
pdf (43K)
ps (331K)
Contents
-
Abstract
pdf (25K)
ps (424K)
-
Introduction
This thesis is about the automatic resolution of
coreference using machine learning techniques. It is a research area
which is becoming increasingly popular in natural language
processing (NLP) research and it is a key task in applications such
as machine translation, automatic summarization and information
extraction for which text understanding is of crucial
importance. When people communicate, they aim for cohesion. Text is
therefore ``not just a string of sentences. It is not simply a large
grammatical unit, something of the same kind as a sentence, but
differing from it in size--a sort of supersentence, a semantic
unit.''(Halliday and Hasan 1976, p. 291). Coreference, in which the
interpretation of an element in conversation depends on a previously
mentioned element, is one possible technique to achieve this
cohesion, a technique to construct that supersentence. Through the
use of shorter or alternative linguistic structures which refer to
previously mentioned elements in spoken or written text, coherent
communication can be achieved. A good text understanding largely
depends on the correct resolution of these coreferential relations.
In this introductory chapter, we provide a definition of
coreference and anaphora and discuss existing knowledge-based and
corpus-based approaches to the task of automatic coreference
resolution. The remainder of the chapter introduces the present
study, lists the central research objectives and gives an overview
of this thesis.
pdf (87K)
ps (460K)
-
Coreferentially annotated
corpora
In the experiments reported in
this thesis, we use two inductive learning methods,
viz. memory-based learning and rule induction, to resolve
coreferential relations between nominal constituents. Since these
corpus-based methods depend on the quality of the corpora they are
trained on, we will discuss in this chapter the importance of
coreference annotation. Section 2.1 introduces the topic of
coreference annotation. In Section 2.2 and Section 2.3, we introduce
the two corpora we will use for our experiments: the well-known and
widely used MUC-6 and MUC-7 corpora for English and the newly
developed KNACK-2002 corpus for Dutch. Section 2.2 describes the
MUC-6 and MUC-7 annotation markup, the annotated relations and the
resulting training and test corpora. Section 2.3 has a similar setup
but focuses on the distinctive features of the KNACK-2002 annotation
guidelines. Section 2.4 discusses the problem of inter-annotator
agreement.
pdf (116K)
ps (471K)
-
Information
sources
In supervised learning of
coreference resolution, one is given a training set containing
labeled instances. These instances consist of attribute/value pairs
which contain possibly disambiguating information for the
classifier, whose task it is to accurately predict the class of
novel instances. A good set of features is crucial for the success
of the resolution system. An ideal feature vector consists of
features which are all highly informative and which can lead the
classifier to optimal performance. This implies that irrelevant
features should be avoided, since the learner can have difficulty in
distinguishing them from the relevant features when making
predictions. Furthermore, it is important to keep the attribute
noise as low as possible, since errors in the feature vector can
heavily affect the predictions.
This chapter deals with the
problem of the selection of information sources for the resolution
of coreferential relations. The first section (3.1) discusses the
preparation of the data sets. We describe the different
preprocessing steps that were taken for the construction of the
training and test corpora. We briefly mention the problem of the
selection of positive and negative instances and the related problem
of the skewed class distributions (Chapter 7 will extensively deal
with the problem of highly skewed training data). In 3.1.3 we
explore the use of three different data sets, viz. one for the
pronouns, one for the named entities and a third data set for the
other noun phrases, instead of one single data set. Section 3.2
gives an overview of the information sources which have been used in
other work on coreference resolution. In this overview, we focus on
the shallow information sources which can easily be computed. We
continue with a description of the different features which we used
for our experiments.
pdf (174K)
ps (555K)
-
Machine learning of
coreference resolution
We will
continue in this chapter with a description of the machine learners
which operate on the basis of the feature vectors explained in the
previous chapter.
This chapter consists of two main parts. The
first three sections introduce the term `bias' and the two machine
learning packages which we will use in our experiments: the
memory-based learning package TIMBL, and the rule induction package
RIPPER. In the second part, Section 4.4, we describe the general
setup of our experiments, discuss the different classifier
performance measures and we apply the two methods to the MUC-6/-7
and KNACK-2002 validation data sets.
pdf (167K)
ps (535K)
-
Selecting the optimal
information sources and algorithm settings
In the previous chapters we paved the way for our
coreference resolution system. We constructed features which we
believe to be helpful in disambiguating between coreferential and
non-coreferential relations and we selected two machine learning
approaches to experiment with. Furthermore, we ran an initial
experiment with our coreference resolution system. In this chapter
and the following chapter on genetic algorithms, we will discuss
some methodological issues involved in running a machine learning
(of language) experiment. We will show empirically that current
methodology in comparative machine learning of language literature
often leads to methodologically debatable results. In this chapter,
we consider at length the importance of feature selection and the
importance of the optimization of the algorithm parameters and we
apply both optimization passes to our coreference resolution data
sets.
pdf (209K)
ps (742K)
-
Genetic algorithms for
optimization
In the previous chapter,
we showed that a proper comparative experiment requires extensive
optimization and that the performance increase obtained by this
optimization is considerable. In the feature selection experiments,
we could observe the large effect feature selection can have on
classifier performance. And in the parameter optimization
experiments, we observed large deviations which confirm the
necessity of parameter optimization. In these previous experiments,
we explored feature selection while keeping the parameters constant
and we explored parameter optimization while keeping the feature
vector unchanged. We did not consider the interaction between
feature selection and parameter optimization.
We will now
proceed to a next optimization step in a set of experiments
performing joint feature selection and parameter optimization. Joint
feature selection and parameter optimization is essentially an
optimization problem which involves searching the space of all
possible feature subsets and parameter settings to identify the
combination that is optimal or near-optimal. Due to the
combinatorially explosive nature of this type of experiment, a
computationally feasible way of optimization has to be found. This
chapter investigates the use of a wrapper-based approach to feature
selection using a genetic algorithm in conjunction with our two
learning methods, TIMBL and RIPPER. In Section 6.1, we give an
introduction to genetic algorithms. Section 6.2 discusses the
implementation details for running the experiments and gives
experimental results on the three data sets. We conclude this
chapter with a summary and discussion.
pdf (142K)
ps (567K)
-
The problem of
imbalanced data sets
A general goal
of classifier learning is to learn a model on the basis of training
data which makes as few errors as possible when classifying
previously unseen test data. Many factors can affect the success of
a classifier: the specific `bias' of the classifier, the selection
and the size of the data set, the choice of algorithm parameters,
the selection and representation of information sources and the
possible interaction between all these factors. In the previous
chapters, we experimentally showed for the eager learner RIPPER and
the lazy learner TIMBL that the performance differences due to
algorithm parameter optimization, feature selection, and the
interaction between both easily overwhelm the performance
differences between both algorithms in their default
representation. We showed how we improved their performance by
optimizing their algorithmic settings and by selecting the most
informative information sources.
In this chapter, our focus
shifts, away from the feature handling level and the algorithmic
level, to the sample selection level. We investigate whether
performance is hindered by the imbalanced class distribution in our
data sets and we explore different strategies to cope with this
skewedness. In Section 7.1, we introduce the problem of learning
from imbalanced data. In the two following sections, we discuss
different strategies for dealing with skewed class distributions. In
Section 7.2, we discuss several proposals made in the machine
learning literature for dealing with skewed data. In Section 7.3, we
narrow our scope to the problem of class imbalances when learning
coreference resolution. In the remainder of the chapter, we focus on
our experiments for handling the class imbalances in the MUC-6,
MUC-7 and KNACK-2002 data sets.
pdf (181K)
ps (718K)
-
Testing
In all
previous chapters, we reported cross-validation results on the
training data. Defining the anaphora resolution process as a
classification problem, however, involves the use of a two-step
procedure. In a first step, the classifier (in our case TIMBL or
RIPPER) decides on the basis of the information learned from the
training set whether the combination of a given anaphor and its
candidate antecedent in the test set is classified as a
coreferential link. Since each NP in the test set is linked with
several preceding NPs, this implies that one single anaphor can be
linked to more than one antecedent, which for its part can also
refer to multiple antecedents, and so on. Therefore, a second step
is taken, which involves the selection of one coreferential link per
anaphor.
In the previous chapters, we focused on the first step
by trying to reach the optimal result through feature selection,
algorithm parameter optimization and different sampling
techniques. In this chapter, we move away from the instance level
and concentrate on the coreferential chains. This requires a new
experimental setup (Section 8.1) with a new evaluation procedure
(Section 8.2). In Section 8.3, we report the results of TIMBL and
RIPPER on the different data sets. Section 8.4 describes the main
observations from a qualitative error analysis on a selection of
English and Dutch documents.
pdf (169K)
ps (545K)
-
Conclusion
In this thesis, we
presented a machine learning approach to the resolution of
coreferential relations between nominal constituents in Dutch. It is
the first automatic resolution approach proposed for this
language. In order to enable a corpus-based strategy, we first
annotated a corpus of Dutch news magazine text, KNACK-2002, with
coreferential information for pronominal, proper noun and common
noun coreferences. A separate learning module was built for each of
these NP types. The main motivation for this approach was that the
information sources which are important for the resolution of the
coreferential relations differ per NP type. This approach was not
only applied to Dutch, for which no comparative results are yet
available, but also to the well-known English MUC-6 and MUC-7 data
sets.
Coreference and the task of coreference resolution was the
main point of interest in Chapters 2 and 3 and in Chapter 8 on
testing. In the chapters in between, we focused on the
methodological issues which arise when performing a machine learning
of coreference resolution experiment, or more broadly, a machine
learning of language experiment. In the following two sections, we
discuss the main observations from the research questions formulated
in Section 1.3.
pdf (79K)
ps (451K)
-
References
pdf (76K)
ps (461K)
Appendixes
-
Manual for
annotation of coreferences in Dutch newspaper
text
pdf (163K)
ps (511K)
-
Ripper rules for
the MUC-6 "Proper nouns" data set
pdf (28K)
ps (423K)
-
Three MUC-7 documents for
which a qualitative error analysis has been carried
out
pdf (64K)
ps (469K)
-
Three KNACK-2002
documents for which a qualitative error analysis has been carried
out
pdf (30K)
ps (422K)