The pattern.nl module contains a fast, regular expressions-based tagger/chunker for Dutch (identifies nouns, adjectives, verbs, etc. in a sentence) and tools for Dutch verb conjugation and noun singularization & pluralization.
The functions in this module take the same parameters and return the same values as their counterparts in pattern.en. Refer to the documentation there for more details.
Noun singularization & pluralization
For Dutch nouns there is singularize() and pluralize(). The implementation is slightly less robust than the English version (accuracy 91% for singularization and 80% for pluralization).
>>> from pattern.nl import singularize, pluralize >>> print singularize('katten') >>> print pluralize('kat') kat katten
For Dutch verbs there is conjugate(), lemma(), lexeme() and tenses(). The lexicon for verb conjugation contains about 4,000 common Dutch verbs; otherwise it will fall back to a rule-based approach with an accuracy of about 80%.
>>> from pattern.nl import conjugate, INFINITIVE >>> print conjugate('was', tense=INFINITIVE) zijn
Attributive & predicative adjectives
Dutch adjectives followed by a noun inflect with an -e suffix (e.g., braaf → brave kat). You can get the base form with the predicative() function, or vice versa with attributive(). Accuracy is 99%.
>>> from pattern.nl import attributive, predicative >>> print predicative('brave') >>> print attributive('braaf') braaf brave
For opinion mining there is sentiment(), which returns a (polarity, subjectivity)-tuple, based on a lexicon of adjectives. It has an accuracy of 81% (P 0.77, R 0.84) for book reviews:
>>> from pattern.nl import sentiment >>> print sentiment('Een onwijs spannend goed boek!') (0.55, 0.90)
For parsing there is parse(), parsetree() and split(). Words processed with parse() are assigned tags such as NN (nouns) or VB (verbs). See the pattern.en documentation ( ) how to manipulate Sentence objects returned from split().
>>> from pattern.nl import parse, split >>> s = parse('De kat zit op de mat.') >>> s = split(s) >>> print s.sentences Sentence('De/DT/B-NP/O kat/NN/I-NP/O zit/VBZ/B-VP/O op/IN/B-PP/B-PNP' 'de/DT/B-NP/I-PNP mat/NN/I-NP/I-PNP ././O/O')
The parser is built on Jeroen Geertzen's Dutch language model. The accuracy is reported around 92%, but the score for the implementation in Pattern can vary slightly from Geertzen's results, since the original WOTAN tagset is mapped to Penn Treebank. If you need to work with the original tags you can also use parse() with an optional parameter tagset="WOTAN".
Reference: Geertzen, J. (2010), Brill-NL, http: //cosmion.net/jeroen/software/brill_pos/