Previous abstract | Contents | Next abstract

Bootstrapping POS taggers using unlabelled data

This paper investigates booststrapping part-of-speech taggers using co-training, in which two taggers are iteratively re-trained on each other's output. Since the output of the taggers is noisy, there is a question of which newly labelled examples to add to the training set. We investigate selecting examples by directly maximising tagger agreement on unlabelled data, a method which has been theoretically and empirically motivated in the co-training literature. results show that agreement-based co-training can significantly improve tagging performance for small seed datasets. Further results show that this form of co-training considerably outperforms self-training. However, we find that simply re-training on all the newly labelled data can, in some cases, yield comparable results to agreement-based co-training, with only a fraction of the computational cost.


Stephen Clark, James R. Curran and Miles Osborne, Bootstrapping POS taggers using unlabelled data. In: Proceedings of CoNLL-2003, Edmonton, Canada, 2003, pp. 49-55. [ps] [ps.gz] [pdf] [bibtex]
Last update: June 11, 2003. erikt@uia.ua.ac.be