Language-Independent Named Entity Recognition (II)

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:

[ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ] .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task will be offered training and test data for two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. For each language, additional information (lists of names and non-annotated data) will be supplied as well. The challenge for the participants is to find ways of incorporating this information in their system.

Background information

Named Entity Recognition (NER) is a subtask of Information Extraction. Different NER systems were evaluated as a part of the Sixth Message Understanding Conference in 1995 (MUC6). The target language was English. The participating systems performed well. However, many of them used language-specific resources for performing the task and it is unknown how they would have performed on another language than English [PD97].

After 1995, NER systems have been developed for some European languages and a few Asian languages. There have been at least two studies that have applied one NER system to different languages. Palmer and Day [PD97] have used statistical methods for finding named entities in newswire articles in Chinese, English, French, Japanese, Portuguese and Spanish. They found that the difficulty of the NER task was different for the six languages but that a large part of the task could be performed with simple methods. Cucerzan and Yarowsky [CY99] used both morphological and contextual clues for identifying named entities in English, Greek, Hindi, Rumanian and Turkish. With minimal supervision, they obtained overall F measures between 40 and 70, depending on the languages used. In the shared task at CoNLL-2002, twelve different learning systems were applied to data in Spanish and Dutch.

Software and Data

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Here is an example:

   U.N.         NNP  I-NP  I-ORG 
   official     NN   I-NP  O 
   Ekeus        NNP  I-NP  I-PER 
   heads        VBZ  I-VP  O 
   for          IN   I-PP  O 
   Baghdad      NNP  I-NP  I-LOC 
   .            .    O     O 

The data consists of three files per language: one training file and two test files testa and testb. The first test file will be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation. There are data files available for English and German. The German files contain an extra column (the second) which holds the lemma of each word.

The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST.

The German data is a collection of articles from the Frankfurter Rundschau. The named entities have been annotated by people of the University of Antwerp. Only the annotations are available here. In order to build these data sets you need access to the ECI Multilingual Text Corpus. It can be ordered from the Linguistic Data Consortium (2003 non-member price: US$ 35.00).

Results

Sixteen systems have participated in the CoNLL-2003 shared task. They used a wide variety of machine learning techniques and different feature sets. Here is the result table for the English test set:

                +-----------+---------+-----------+
     English    | precision |  recall |     F     |
   +------------+-----------+---------+-----------+
   | [FIJZ03]   |  88.99%   |  88.54% | 88.76±0.7 |
   | [CN03]     |  88.12%   |  88.51% | 88.31±0.7 |
   | [KSNM03]   |  85.93%   |  86.21% | 86.07±0.8 |
   | [ZJ03]     |  86.13%   |  84.88% | 85.50±0.9 |
   | [CMP03b]   |  84.05%   |  85.96% | 85.00±0.8 |
   | [CC03]     |  84.29%   |  85.50% | 84.89±0.9 |
   | [MMP03]    |  84.45%   |  84.90% | 84.67±1.0 |
   | [CMP03a]   |  85.81%   |  82.84% | 84.30±0.9 |
   | [ML03]     |  84.52%   |  83.55% | 84.04±0.9 |
   | [BON03]    |  84.68%   |  83.18% | 83.92±1.0 |
   | [MLP03]    |  80.87%   |  84.21% | 82.50±1.0 |
   | [WNC03]*   |  82.02%   |  81.39% | 81.70±0.9 |
   | [WP03]     |  81.60%   |  78.05% | 79.78±1.0 |
   | [HV03]     |  76.33%   |  80.17% | 78.20±1.0 |
   | [DD03]     |  75.84%   |  78.13% | 76.97±1.2 |
   | [Ham03]    |  69.09%   |  53.26% | 60.15±1.3 |
   +------------+-----------+---------+-----------+
   | baseline   |  71.91%   |  50.90% | 59.61±1.2 |
   +------------+--------- -+---------+-----------+

                +-----------+---------+-----------+
     German     | precision |  recall |     F     |
   +------------+-----------+---------+-----------+
   | [FIJZ03]   |  83.87%   |  63.71% | 72.41±1.3 |
   | [KSNM03]   |  80.38%   |  65.04% | 71.90±1.2 |
   | [ZJ03]     |  82.00%   |  63.03% | 71.27±1.5 |
   | [MMP03]    |  75.97%   |  64.82% | 69.96±1.4 |
   | [CMP03b]   |  75.47%   |  63.82% | 69.15±1.3 |
   | [BON03]    |  74.82%   |  63.82% | 68.88±1.3 |
   | [CC03]     |  75.61%   |  62.46% | 68.41±1.4 |
   | [ML03]     |  75.97%   |  61.72% | 68.11±1.4 |
   | [MLP03]    |  69.37%   |  66.21% | 67.75±1.4 |
   | [CMP03a]   |  77.83%   |  58.02% | 66.48±1.5 |
   | [WNC03]    |  75.20%   |  59.35% | 66.34±1.3 |
   | [CN03]     |  76.83%   |  57.34% | 65.67±1.4 |
   | [HV03]     |  71.15%   |  56.55% | 63.02±1.4 |
   | [DD03]     |  63.93%   |  51.86% | 57.27±1.6 |
   | [WP03]     |  71.05%   |  44.11% | 54.43±1.4 |
   | [Ham03]    |  63.49%   |  38.25% | 47.74±1.5 |
   +------------+-----------+---------+-----------+
   | baseline   |  31.86%   |  28.89% | 30.30±1.3 |
   +------------+--------- -+---------+-----------+

Here are some remarks on these results:

A discussion of the shared task results can be found in the introduction paper [TD03].

Related information

References

This is a list of papers that are relevant for this task.

CoNLL-2003 Shared Task Papers

Other related publications

A paper that is related to the topic of this shared task is the EMNLP-99 paper by Cucerzan and Yarowsky [CY99]. Interesting papers about using unsupervised data, though not for named entity recognition, are those of Mitchell [Mit99] and Banko and Brill [BB01].


Last update: December 05, 2005. erik.tjongkimsang@ua.ac.be, fien.demeulder@ua.ac.be