20030423 CONLL-2003 SHARED TASK GENERAL This is the 20030423 release of the data for the CoNLL-2003 shared task. In order to be able to use this data you need the Reuters Corpus cd (for the English data) and the ECI Multilingual Text cd (for the German data) which can be obtained via the following two addresses: http://trec.nist.gov/data/reuters/reuters.html http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5 This distribution only contains annotation information and filter software. The words of the articles in the two corpora have not been included here for copyright reasons. That is why you need the the two cds for building the complete data sets. The CoNLL-2003 shared task deals with Language-Independent Named Entity Recognition. The two languages we deal with are English and German. More information about this shared task can be found on the related web page http://www.cnts.ua.ac.be/conll2003/ner/ BUILDING THE TRAIN AND TEST DATA FILES In order to obtain the data files you need to perform three steps: 1. Extract the CoNLL-2003 files from the tar file available at http://www.cnts.ua.ac.be/conll2003/ner.tgz (tar zxf ner.tgz) 2a (English data) Insert the first cd of the Reuters Corpus in your computer and mount it (mount /mnt/cdrom) 2b (German data) Insert the ECI Multilingual Text cd in your computer and mount it (mount /mnt/cdrom) 3. Run the relevant extraction software from the ner directory English: cd ner; bin/make.eng German: cd ner; bin/make.deu This will generate the training data (either eng.train or deu.train), the development test data (eng.testa or deu.testa) and the final test data (eng.testb or deu.testb) in the ner directory. You can use the development data as test data during the development process of your system. When your system works well, it can be applied to the final test data. These instructions assume that you work on a Linux work station and that the cd files are available from the directory /mnt/cdrom . This procedure might not work on other platforms. News: January 26, 2006: Sven Hartrumpf from the FernUniversitat in Hagen, Germany has checked and revised the entity annotations of the German data. The new version is believed to be more accurate than the previous one which was done by nonnative speakers. The files associated with the new annotations can be found in the directory ner/etc.2006 BUILDING THE UNANNOTATED DATA FILES The unannotated data files can be build in the same way as the train and test files. However, because of their size the annotation of these files has been stored in separate tar files which you should fetch first. Make sure that you have fetched and unpacked the main tar file ner.tgz because that contains the software for building the files with unannotated data. Here are the steps you should perform: 1. Extract the CoNLL-2003 files from the tar file available at http://www.cnts.ua.ac.be/conll2003/ner/ner.tgz (tar zxf eng.tgz) 2a (English data) Extract the unannotated annotation files from http://www.cnts.ua.ac.be/conll2003/ner/eng.raw.tar (tar xf eng.raw.tar) 2b (German data) Extract the unannotated annotation files from http://www.cnts.ua.ac.be/conll2003/ner/deu.raw.tar (tar xf deu.raw.tar) 3a (English data) Insert the first cd of the Reuters Corpus in your computer and mount it (mount /mnt/cdrom) 4b (German data) Insert the first cd of the Reuters Corpus in your computer and mount it (mount /mnt/cdrom) 4. Run the relevant extraction software from the ner directory English: cd ner; bin/make.eng.raw German: cd ner; bin/make.deu.raw This will generate the file eng.raw.gz (or deu.raw.gz) in the ner directory. These files have been compressed with gzip. These instructions assume that you work on a Linux work station and that the cd files are available from the directory /mnt/cdrom . This procedure might not work on other platforms. DATA FORMAT The data files contain one word per line. Empty lines have been used for marking sentence boundaries and a line containing the keyword -DOCSTART- has been added to the beginning of each article in order to mark article boundaries. Each non-empty line contains the following tokens: 1. the current word 2. the lemma of the word (German only) 3. the part-of-speech (POS) tag generated by a tagger 4. the chunk tag generated by a text chunker 5. the named entity tag given by human annotators The tagger and chunker for English are roughly similar to the ones used in the memory-based shallow parser demo available at http://ilk.uvt.nl/ German POS and chunk information has been generated by the Treetagger from the University of Stuttgart: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ In order to simulate a real natural language processing environment, the POS tags and chunk tags have not been checked. This means that they will contain errors. If you have access to annotation software with a performance that is superior to this, you may replace these tags by yours. The chunk tags and the named entity tags use the IOB1 format. This means that in general words inside entity receive the tag I-TYPE to denote that they are Inside an entity of type TYPE. Whenever two entities of the same type immediately follow each other, the first word of the second entity will receive tag B-TYPE rather than I-TYPE in order to show that a new entity starts at that word. The raw data has the same format as the training and test material but the final column has been ommitted. There are word lists for English (extracted from the training data), German (extracted from the training data), and Dutch in the directory lists. Probably you can use the Dutch person names (PER) for English data as well. Feel free to use any other external data sources that you might have access to. GOALS In the CoNLL-2002 shared task we have worked on named entity recognition as well (Spanish and Dutch). The CoNLL-2003 shared task deals with two different languages (English and German). Additionally we supply additional information: lists of named entities and non-annotated data. One of the main tasks of the participants in the CoNLL-2003 shared task will be to find out how these additional resources can be used to improve the performance of their system. BASELINE The baseline performance for this shared task is assigning named entity classes to word sequences that occur in the training data. It can be computed as follows (example for English development data): bin/baseline eng.train eng.testa | bin/conlleval and the results are: eng.testa: precision: 78.33%; recall: 65.23%; FB1: 71.18 eng.testb: precision: 71.91%; recall: 50.90%; FB1: 59.61 deu.testa: precision: 37.19%; recall: 26.07%; FB1: 30.65 deu.testb: precision: 31.86%; recall: 28.89%; FB1: 30.30 If you build a system for this task, it should at least improve on the performance of this baseline system. Antwerp, April 23, 2003 Erik Tjong Kim Sang Fien De Meulder