-------------------------------------------------------------------------------
NOTE: ADDED 16 AUGUST 2016

The Reuters Corpus is not distributed on a cd anymore but as a single
compressed file rcv1.tar.xz . In order to extract the English shared 
task files from this file, you need the script ner/bin/make.eng.2016

1. download ner.tgz from http://www.clips.uantwerpen.be/conll2003/ner/
2. extract the files from ner.tgz: tar zxf ner.tgz
3. put the Reuters file rcv1.tar.xz in the new directory ner
4. run make.eng.2016 from directory ner: cd ner; bin/make.eng.2016

This should generate the three files eng.train, eng.testa and eng.testb 

Contact: erikt(at)xs4all.nl
-------------------------------------------------------------------------------

20030423 CONLL-2003 SHARED TASK


GENERAL

This is the 20030423 release of the data for the CoNLL-2003 shared 
task. In order to be able to use this data you need the Reuters 
Corpus cd (for the English data) and the ECI Multilingual Text cd 
(for the German data) which can be obtained via the following two 
addresses:

   http://trec.nist.gov/data/reuters/reuters.html
   http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94T5

This distribution only contains annotation information and filter 
software. The words of the articles in the two corpora have not
been included here for copyright reasons. That is why you need the
the two cds for building the complete data sets.

The CoNLL-2003 shared task deals with Language-Independent Named 
Entity Recognition. The two languages we deal with are English
and German. More information about this shared task can be found 
on the related web page http://www.cnts.ua.ac.be/conll2003/ner/


BUILDING THE TRAIN AND TEST DATA FILES

In order to obtain the data files you need to perform three steps:

   1. Extract the CoNLL-2003 files from the tar file available at
      http://www.cnts.ua.ac.be/conll2003/ner.tgz
      (tar zxf ner.tgz)
   2a (English data) Insert the first cd of the Reuters Corpus in 
      your computer and mount it (mount /mnt/cdrom)
   2b (German data) Insert the ECI Multilingual Text cd in 
      your computer and mount it (mount /mnt/cdrom)
   3. Run the relevant extraction software from the ner directory
      English: cd ner; bin/make.eng
      German:  cd ner; bin/make.deu

This will generate the training data (either eng.train or deu.train),
the development test data (eng.testa or deu.testa) and the final test 
data (eng.testb or deu.testb) in the ner directory. You can use the
development data as test data during the development process of your
system. When your system works well, it can be applied to the final
test data.

These instructions assume that you work on a Linux work station and
that the cd files are available from the directory /mnt/cdrom . This
procedure might not work on other platforms. 

News: January 26, 2006: Sven Hartrumpf from the FernUniversitat in 
Hagen, Germany has checked and revised the entity annotations of the
German data. The new version is believed to be more accurate than 
the previous one which was done by nonnative speakers. The files 
associated with the new annotations can be found in the directory 
ner/etc.2006


BUILDING THE UNANNOTATED DATA FILES

The unannotated data files can be build in the same way as the train 
and test files. However, because of their size the annotation of these
files has been stored in separate tar files which you should fetch
first. Make sure that you have fetched and unpacked the main tar file
ner.tgz because that contains the software for building the files
with unannotated data. Here are the steps you should perform:

   1. Extract the CoNLL-2003 files from the tar file available at
      http://www.cnts.ua.ac.be/conll2003/ner/ner.tgz
      (tar zxf eng.tgz)
   2a (English data) Extract the unannotated annotation files from
      http://www.cnts.ua.ac.be/conll2003/ner/eng.raw.tar
      (tar xf eng.raw.tar)
   2b (German data) Extract the unannotated annotation files from
      http://www.cnts.ua.ac.be/conll2003/ner/deu.raw.tar
      (tar xf deu.raw.tar)
   3a (English data) Insert the first cd of the Reuters Corpus in 
      your computer and mount it (mount /mnt/cdrom)
   4b (German data) Insert the first cd of the Reuters Corpus in 
      your computer and mount it (mount /mnt/cdrom)
   4. Run the relevant extraction software from the ner directory
      English: cd ner; bin/make.eng.raw
      German:  cd ner; bin/make.deu.raw

This will generate the file eng.raw.gz (or deu.raw.gz) in the ner
directory. These files have been compressed with gzip.

These instructions assume that you work on a Linux work station and
that the cd files are available from the directory /mnt/cdrom . This
procedure might not work on other platforms.


DATA FORMAT

The data files contain one word per line. Empty lines have been used
for marking sentence boundaries and a line containing the keyword
-DOCSTART- has been added to the beginning of each article in order
to mark article boundaries. Each non-empty line contains the following 
tokens:

   1. the current word
   2. the lemma of the word (German only)
   3. the part-of-speech (POS) tag generated by a tagger
   4. the chunk tag generated by a text chunker
   5. the named entity tag given by human annotators

The tagger and chunker for English are roughly similar to the 
ones used in the memory-based shallow parser demo available at 
http://ilk.uvt.nl/  German POS and chunk information has been 
generated by the Treetagger from the University of Stuttgart:
http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
In order to simulate a real natural language processing 
environment, the POS tags and chunk tags have not been checked. 
This means that they will contain errors. If you have access to 
annotation software with a performance that is superior to this, 
you may replace these tags by yours.

The chunk tags and the named entity tags use the IOB1 format. This 
means that in general words inside entity receive the tag I-TYPE
to denote that they are Inside an entity of type TYPE. Whenever
two entities of the same type immediately follow each other, the 
first word of the second entity will receive tag B-TYPE rather than
I-TYPE in order to show that a new entity starts at that word.

The raw data has the same format as the training and test material
but the final column has been ommitted. There are word lists for 
English (extracted from the training data), German (extracted from 
the training data), and Dutch in the directory lists. Probably you 
can use the Dutch person names (PER) for English data as well. Feel 
free to use any other external data sources that you might have 
access to.


GOALS

In the CoNLL-2002 shared task we have worked on named entity 
recognition as well (Spanish and Dutch). The CoNLL-2003 shared 
task deals with two different languages (English and German). 
Additionally we supply additional information: lists of named 
entities and non-annotated data. One of the main tasks of 
the participants in the CoNLL-2003 shared task will be to find 
out how these additional resources can be used to improve the 
performance of their system.


BASELINE

The baseline performance for this shared task is assigning named
entity classes to word sequences that occur in the training data.
It can be computed as follows (example for English development data):

   bin/baseline eng.train eng.testa | bin/conlleval

and the results are:

   eng.testa: precision:  78.33%; recall:  65.23%; FB1:  71.18
   eng.testb: precision:  71.91%; recall:  50.90%; FB1:  59.61
   deu.testa: precision:  37.19%; recall:  26.07%; FB1:  30.65
   deu.testb: precision:  31.86%; recall:  28.89%; FB1:  30.30

If you build a system for this task, it should at least improve on
the performance of this baseline system.


Antwerp, April 23, 2003

Erik Tjong Kim Sang <erik.tjongkimsang@ua.ac.be>
Fien De Meulder <fien.demeulder@ua.ac.be>