These files contain the train and test data for for the three parts of the CoNLL-2002 shared task: esp.testa: Spanish test data for the development stage esp.testb: Spanish test data esp.train: Spanish train data ned.testa: Dutch test data for the development stage ned.testb: Dutch test data ned.train: Dutch train data All data files contain a single word per line with it associated named entity tag in the IOB2 format (Tjong Kim Sang and Veenstra, EACL 1999). Sentence breaks are encoded by empty lines. Additionally the Dutch data contains non-checked part-of-speech tags generated by the MBT tagger (Daelemans et.al., WVLC 1996). In the Dutch data article boundaries have been marked by a special tag (-DOCSTART-). Associated url: http://lcg-www.uia.ac.be/conll2002/ner/ NOTES * Files in these directories may only be used for research applications in the context of the CoNLL-2002 shared task. No permission is given for usage other applications especially not for commercial applications. * Some redundant empty lines have been removed from the Spanish data files at May 1, 2002. The extra empty lines had no effect on the evaluation results. * An extra checkup round has been applied to the Dutch data files and these have been replaced by new versions on August 22, 2002. The original Dutch files which have been used by the participants of CoNLL-2002 can be found in the subdirectory OldFiles. * Note that for copyright reasons the sentences in the Dutch files have been randomized within each article. Your system can rely on sentences between two article boundaries being of the same article but it should not rely on first occurrences of entities. * Xavier Carreras provides the Spanish data sets with part of speech tags at http://www.lsi.upc.es/~nlp/tools/nerc/nerc.html (20030803) * Inconsistencies in the named entity annotation can be reported to Erik Tjong Kim Sang . ACKNOWLEDGEMENTS The Spanish data is a collection of news wire articles made available by the Spanish EFE News Agency. The articles are from May 2000. The annotation was carried out by the TALP Research Center (http://www.talp.upc.es/) of the Technical University of Catalonia (UPC) and the Center of Language and Computation (CLiC, http://clic.fil.ub.es/) of the University of Barcelona (UB), and funded by the European Commission through the NAMIC project (IST-1999-12392). The Dutch data consist of four editions of the Belgian newspaper "De Morgen" of 2000 (June 2, July 1, August 1 and September 1). The data was annotated as a part of the Atranos project (http://atranos.esat.kuleuven.ac.be/) at the University of Antwerp.