back to list of abstracts
back to time schedule

Memory-based Arabic Morphological Analysis

Erwin Marsi 1, Abdelhadi Soudi 2 and Antal van den Bosch 1

1{e.c.marsi,antal.vdnbosch}@uvt.nl

ILK -- Computational Linguistics and AI
University of Tilburg -- The Netherlands

2asoudi@enim.ac.ma -- asoudi@dfki.de

Ecole Nationale de L'Industrie Minérale
Rabat -- Morocco

Deutsche Forschungszentrum für Künstliche
Intelligenz
Saarbrücken -- Germany

We describe a machine-learning approach to producing morphological analyses of Arabic words. The core of the analyser is a memory-based classifier that predicts operators at every consonant in (unvowelled) Arabic words. The operators encode segmentation boundaries, morphosyntactic information, consonant changes, and vowel insertions for all possible readings. An Arabic morphologically analysed text corpus is used to generate a lexicon of unvowelled wordforms, each of which is associated with all of its (often many) possible analyses. The operators encode the variations by means of disjunctive expressions. The analyser works in two phases: first, the memory-based classifier generates letter-by-letter operators, and second, an analysis generator module performs the generated operations to construct all analyses.

We discuss a number of challenges in the application of memory-based learning to non-concatenative morphology and discuss the overgeneration of analyses that follows from context limitations in the original class representation we designed. Furthermore, we report on generalization experiments on unseen wordforms, and provide a qualitative error analysis. Being the first known application of memory-based learning to a semitic language, we discuss the road ahead in this work in progress, including training the analyser on a large full-form lexicon of Arabic, which is under construction.