Language-Independent Named Entity Recognition (II)

Named entities are phrases that contain the names of persons, organizations, locations, times and quantities. Example:

[ORG U.N. ] official [PER Ekeus ] heads for [LOC Baghdad ] .

The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on four types of named entities: persons, locations, organizations and names of miscellaneous entities that do not belong to the previous three groups. The participants of the shared task will be offered training and test data for two languages. They will use the data for developing a named-entity recognition system that includes a machine learning component. For each language, additional information (lists of names and non-annotated data) will be supplied as well. The challenge for the participants is to find ways of incorporating this information in their system.

Background information

Named Entity Recognition (NER) is a subtask of Information Extraction. Different NER systems were evaluated as a part of the Sixth Message Understanding Conference in 1995 (MUC6). The target language was English. The participating systems performed well. However, many of them used language-specific resources for performing the task and it is unknown how they would have performed on another language than English [PD97].

After 1995, NER systems have been developed for some European languages and a few Asian languages. There have been at least two studies that have applied one NER system to different languages. Palmer and Day [PD97] have used statistical methods for finding named entities in newswire articles in Chinese, English, French, Japanese, Portuguese and Spanish. They found that the difficulty of the NER task was different for the six languages but that a large part of the task could be performed with simple methods. Cucerzan and Yarowsky [CY99] used both morphological and contextual clues for identifying named entities in English, Greek, Hindi, Rumanian and Turkish. With minimal supervision, they obtained overall F measures between 40 and 70, depending on the languages used. In the shared task at CoNLL-2002, twelve different learning systems were applied to data in Spanish and Dutch.

Software and Data

The CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on a separate line and there is an empty line after each sentence. The first item on each line is a word, the second a part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags and the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only if two phrases of the same type immediately follow each other, the first word of the second phrase will have tag B-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Here is an example:

   U.N.         NNP  I-NP  I-ORG 
   official     NN   I-NP  O 
   Ekeus        NNP  I-NP  I-PER 
   heads        VBZ  I-VP  O 
   for          IN   I-PP  O 
   Baghdad      NNP  I-NP  I-LOC 
   .            .    O     O

The data consists of three files per language: one training file and two test files testa and testb. The first test file will be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation. There are data files available for English and German. The German files contain an extra column (the second) which holds the lemma of each word.

http://www.cnts.ua.ac.be/conll2003/ner.tgz
The training, development and test data sets for English and German as well as evaluation software for this shared task in one gzipped tar file. Fetch this file, extract the data files with the command tar zxf ner.tgz and follow the instructions in the file ner/000README.
Individual files: README, annotation, lists, software
http://www.cnts.ua.ac.be/conll2003/eng.raw.tar
http://www.cnts.ua.ac.be/conll2003/deu.raw.tar
The unannotated data sets for English and German in one tar file (README).
Note: you need the main file ner.tgz as well. It contains the software for building the data.
http://www.cnts.ua.ac.be/conll2003/ner/annotation.txt
Annotation guidelines for the named entity tasks.
http://www.cnts.ua.ac.be/conll2000/chunking/output.html
Output example of the evaluation program for this shared task: conlleval. The example deals with text chunking, a task which uses the same output format as this named entity task. The program requires the output of the NER system for each word to be appended to the corresponding line in the test file, with a single space between the line and the output tag. Make sure you keep the empty lines in the test file otherwise the software may mistakingly regard separate entities as one big entity.

The English data is a collection of news wire articles from the Reuters Corpus. The annotation has been done by people of the University of Antwerp. Because of copyright reasons we only make available the annotations. In order to build the complete data sets you will need access to the Reuters Corpus. It can be obtained for research purposes without any charge from NIST.

The German data is a collection of articles from the Frankfurter Rundschau. The named entities have been annotated by people of the University of Antwerp. Only the annotations are available here. In order to build these data sets you need access to the ECI Multilingual Text Corpus. It can be ordered from the Linguistic Data Consortium (2003 non-member price: US$ 35.00).

Results

Sixteen systems have participated in the CoNLL-2003 shared task. They used a wide variety of machine learning techniques and different feature sets. Here is the result table for the English test set:

                +-----------+---------+-----------+
     English    | precision |  recall |     F     |
   +------------+-----------+---------+-----------+
   | [FIJZ03]   |  88.99%   |  88.54% | 88.76±0.7 |
   | [CN03]     |  88.12%   |  88.51% | 88.31±0.7 |
   | [KSNM03]   |  85.93%   |  86.21% | 86.07±0.8 |
   | [ZJ03]     |  86.13%   |  84.88% | 85.50±0.9 |
   | [CMP03b]   |  84.05%   |  85.96% | 85.00±0.8 |
   | [CC03]     |  84.29%   |  85.50% | 84.89±0.9 |
   | [MMP03]    |  84.45%   |  84.90% | 84.67±1.0 |
   | [CMP03a]   |  85.81%   |  82.84% | 84.30±0.9 |
   | [ML03]     |  84.52%   |  83.55% | 84.04±0.9 |
   | [BON03]    |  84.68%   |  83.18% | 83.92±1.0 |
   | [MLP03]    |  80.87%   |  84.21% | 82.50±1.0 |
   | [WNC03]*   |  82.02%   |  81.39% | 81.70±0.9 |
   | [WP03]     |  81.60%   |  78.05% | 79.78±1.0 |
   | [HV03]     |  76.33%   |  80.17% | 78.20±1.0 |
   | [DD03]     |  75.84%   |  78.13% | 76.97±1.2 |
   | [Ham03]    |  69.09%   |  53.26% | 60.15±1.3 |
   +------------+-----------+---------+-----------+
   | baseline   |  71.91%   |  50.90% | 59.61±1.2 |
   +------------+--------- -+---------+-----------+

                +-----------+---------+-----------+
     German     | precision |  recall |     F     |
   +------------+-----------+---------+-----------+
   | [FIJZ03]   |  83.87%   |  63.71% | 72.41±1.3 |
   | [KSNM03]   |  80.38%   |  65.04% | 71.90±1.2 |
   | [ZJ03]     |  82.00%   |  63.03% | 71.27±1.5 |
   | [MMP03]    |  75.97%   |  64.82% | 69.96±1.4 |
   | [CMP03b]   |  75.47%   |  63.82% | 69.15±1.3 |
   | [BON03]    |  74.82%   |  63.82% | 68.88±1.3 |
   | [CC03]     |  75.61%   |  62.46% | 68.41±1.4 |
   | [ML03]     |  75.97%   |  61.72% | 68.11±1.4 |
   | [MLP03]    |  69.37%   |  66.21% | 67.75±1.4 |
   | [CMP03a]   |  77.83%   |  58.02% | 66.48±1.5 |
   | [WNC03]    |  75.20%   |  59.35% | 66.34±1.3 |
   | [CN03]     |  76.83%   |  57.34% | 65.67±1.4 |
   | [HV03]     |  71.15%   |  56.55% | 63.02±1.4 |
   | [DD03]     |  63.93%   |  51.86% | 57.27±1.6 |
   | [WP03]     |  71.05%   |  44.11% | 54.43±1.4 |
   | [Ham03]    |  63.49%   |  38.25% | 47.74±1.5 |
   +------------+-----------+---------+-----------+
   | baseline   |  31.86%   |  28.89% | 30.30±1.3 |
   +------------+--------- -+---------+-----------+

Here are some remarks on these results:

The baseline results have been produced by a system which only selects complete unambiguous named entities which appear in the training data.
The significance intervals for the F rates have been obtained with bootstrap resampling [Nor89]. F rates outside of these intervals are assumed to be significantly different from the related F rate (p<0.05).
The results of the system of [WNC03] for the English test data have been corrected in their paper after the submission deadline (new F=82.69, see their paper).

A discussion of the shared task results can be found in the introduction paper [TD03].

Related information

http://www.cnts.ua.ac.be/conll2003/
Home page of the workshop on Computational Natural Language Learning (CoNLL-2003) of which this shared task is part of.
http://www.cnts.ua.ac.be/conll2002/ner/
The shared task of CoNLL-2002 dealt with language-independent named entity recognition as well (for Spanish and Dutch).
http://research.microsoft.com/conferences/mulner-acl03/
At ACL 2003 (July 12, 2003, Sapporo, Japan) there is a workshop with a similar topic as this shared task: Multilingual and Mixed-language Named Entity Recognition: Combining Statistical and Symbolic Models.
http://www.accenture.com/techlabs/icmlworkshop2003/
Another related workshop will be held at ICML 2003 (August 21, 2003, Washinton DC, USA): The Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.
http://www.cs.nyu.edu/cs/faculty/grishman/muc6.html
Home page of the Sixth Message Understanding Conference (1995) that introduced named entity recognition as shared task.
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/
Home page of the Seventh Message Understanding Conference (1998) which contained a named entity recognition as shared task.
http://www.nist.gov/speech/tests/ie-er/er_99/er_99.htm
Home page of the 1999 DARPA-TIDES Information Extraction-Entity Recognition (IE-ER) technology evaluation project, which contained a named entity recognition task.
http://www.itl.nist.gov/iaui/894.02/related_projects/tipster/met.htm
Information on the Multilingual Entity Task Conference (MET) which contained named entity recognition for Chinese, Japanese and Spanish (overview).
http://www.calle.com/world/
List of about 2.8 million locations on Earth.

References

This is a list of papers that are relevant for this task.

CoNLL-2003 Shared Task Papers

[TD03]
Erik F. Tjong Kim Sang and Fien De Meulder, Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 142-147.
paper: [ps] [ps.gz] [pdf] [bibtex] (with corrections)
sheets: [ps] [ps.gz] [pdf]
[BON03]
Oliver Bender, Franz Josef Och and Hermann Ney, Maximum Entropy Models for Named Entity Recognition In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 148-151.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[CMP03a]
Xavier Carreras, Lluís Màrquez, and Lluís Padró, Learning a Perceptron-Based Named Entity Chunker via Online Recognition Feedback. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 156-159.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[CMP03b]
Xavier Carreras, Lluís Màrquez, and Lluís Padró, A Simple Named Entity Extractor using AdaBoost. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 152-155.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[CN03]
Hai Leong Chieu and Hwee Tou Ng, Named Entity Recognition with a Maximum Entropy Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 160-163.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[CC03]
James R. Curran and Stephen Clark, Language Independent NER using a Maximum Entropy Tagger. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 164-167.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[DD03]
Fien De Meulder and Walter Daelemans, Memory-Based Named Entity Recognition using Unannotated Data. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 208-211.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[FIJZ03]
Radu Florian, Abe Ittycheriah, Hongyan Jing and Tong Zhang, Named Entity Recognition through Classifier Combination. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 168-171.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[Ham03]
James Hammerton, Named Entity Recognition with Long Short-Term Memory. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 172-175.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[HV03]
Iris Hendrickx and Antal van den Bosch, Memory-based one-step named-entity recognition: Effects of seed list features, classifier stacking, and unannotated data. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 176-179.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[KSNM03]
Dan Klein, Joseph Smarr, Huy Nguyen and Christopher D. Manning, Named Entity Recognition with Character-Level Models. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 180-183.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[MMP03]
James Mayfield, Paul McNamee and Christine Piatko, Named Entity Recognition using Hundreds of Thousands of Features. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 184-187.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[ML03]
Andrew McCallum and Wei Li, Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 188-191.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[MLP03]
Robert Munro, Daren Ler, and Jon Patrick, Meta-Learning Orthographic and Contextual Models for Language Independent Named Entity Recognition. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 192-195.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[WP03]
Casey Whitelaw and Jon Patrick, Named Entity Recognition Using a Character-based Probabilistic Approach. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 196-199.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]
[WNC03]
Dekai Wu, Grace Ngai and Marine Carpuat, A Stacked, Voted, Stacked Model for Named Entity Recognition. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 200-203.
paper: [ps] [ps.gz] [pdf] [bibtex] (with corrections)
system output: [tgz] [files]
[ZJ03]
Tong Zhang and David Johnson, A Robust Risk Minimization based Named Entity Recognition System. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 204-207.
paper: [ps] [ps.gz] [pdf] [bibtex]
system output: [tgz] [files]

Other related publications

A paper that is related to the topic of this shared task is the EMNLP-99 paper by Cucerzan and Yarowsky [CY99]. Interesting papers about using unsupervised data, though not for named entity recognition, are those of Mitchell [Mit99] and Banko and Brill [BB01].

[BB01]
Michele Banko and Eric Brill, Scaling to Very Very Large Corpora for Natural Language Disambiguation. In Proceedings of ACL 2001, Toulouse, France, 2001, pp. 26-33.
http://www.research.microsoft.com/users/mbanko/ACL2001VeryVeryLargeCorpora.pdf
[Bor99]
Andrew Borthwick, A Maximum Entropy Approach to Named Entity Recognition. PhD thesis, New York University, 1999.
http://cs.nyu.edu/cs/projects/proteus/publication/papers/borthwick_thesis.ps
[BV00]
Sabine Buchholz and Antal van den Bosch, Integrating seed names and n-grams for a named entity list and classifier, In: Proceedings of LREC-2000, Athens, Greece, June 2000, pp. 1215-1221.
http://ilk.kub.nl/downloads/pub/papers/ilk.0002.ps.gz
[CM03]
Xavier Carreras and Lluís Màrquez, Phrase Recognition by Filtering and Ranking with Perceptrons. In "Proceedings of the International Conference on Recent Advances in Natural Language Processing, RANLP-2003", Borovets, Bulgaria, 2003.
http:// www.lsi.upc.es/~nlp/papers/2003/ranlp2003-cm.ps.gz
[CMP02]
Xavier Carreras, Lluís Màrques and Lluís Padró, Named Entity Extraction using AdaBoost In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 167-170.
http://www.cnts.ua.ac.be/conll2002/ps/16770car.ps
[CBFR99]
Nancy Chinchor, Erica Brown, Lisa Ferro and Patty Robinson, 1999 Named Entity Recognition Task Definition, MITRE, 1999.
http://www.nist.gov/speech/tests/ie-er/er_99/doc/ne99_taskdef_v1_4.pdf
[Col02]
Michael Collins, Ranking Algorithms for Named-Entity Extraction: Boosting and the Voted Perceptron In Proceedings of ACL 2002, University of Pennsylvania, PA, 2002.
http://www.ai.mit.edu/people/mcollins/papers/finalNEacl2002.ps
[CS99]
Michael Collins and Yoram Singer, Unsupervised models for named entity classification. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, University of Maryland, MD, 1999.
http://citeseer.nj.nec.com/collins99unsupervised.html
[CY99]
Silviu Cucerzan and David Yarowsky, Language independent named entity recognition combining morphological and contextual evidence. In Proceedings of 1999 Joint SIGDAT Conference on EMNLP and VLC, University of Maryland, MD, 1999.
http://citeseer.nj.nec.com/cucerzan99language.html
[Mit99]
Tom M. Mitchell, The Role of Unlabeled Data in Supervised Learning. In Proceedings of the Sixth International Colloquium on Cognitive Science, San Sebastian, Spain, 1999.
http://citeseer.nj.nec.com/mitchell99role.html
[MMG99]
Andrei Mikheev, Marc Moens and Claire Grover, Named Entity Recognition without Gazetteers, In Proceedings of EACL'99, Bergen, Norway, 1999, pp. 1-8.
http://www.ltg.ed.ac.uk/~mikheev/papers_my/eacl99.ps
[Nor89]
Eric W. Noreen, Computer-Intensive Methods for Testing Hypotheses John Wiley & Sons, 1989.
[PD97]
David D. Palmer and David S. Day, A Statistical Profile of the Named Entity Task. In Proceedings of Fifth ACL Conference for Applied Natural Language Processing (ANLP-97), Washington D.C., 1997
http://crow.ee.washington.edu/people/palmer/papers/anlp97.ps
[TKS02]
Erik F. Tjong Kim Sang, Introduction to the CoNLL-2002 Shared Task: Language-Independent Named Entity Recognition. In: Proceedings of CoNLL-2002, Taipei, Taiwan, 2002, pp. 155-158.
http://www.cnts.ua.ac.be/conll2002/ps/15558tjo.ps

Last update: December 05, 2005. erik.tjongkimsang@ua.ac.be, fien.demeulder@ua.ac.be