Computational stylometry (PhD topic)
Supervisor: Walter Daelemans
Embedded in the Computational Techniques for Stylometry for Dutch project (funded by FWO - National Science Foundation Flanders)
Style characteristics in text are automatically extracted and analysed using automatic linguistic analysis and supervised and unsupervised machine learning techniques. The focus is on the application of these techniques in Authorship attribution & verification and personality prediction.
The main methodology covers several aspects:
- 1. Automatic linguistic analysis of documents by means of available text analysis tools on the level of morphological structure, part of speech, global syntactic structures and semantic roles (subject, object, temporal, location) for the construction of potentially relevant stylistic characteristics.
- 2. Unsupervised and supervised learning techniques for selecting characteristics with high information value and constructing a model of authorial style.
- 3. Evaluation of these models by (a) comparison with stylistic analyses in linguistics and literary science and (b) empiric testing of the predictive power of the models.
The most recent developments in my PhD research can be read in this COLING 2008 paper. We are currently developing a stylometry tool in Python with online interface called TACTiCS (Tool for Analyzing and Categorizing Texts using Characteristics of Style), for which a teaser demo is available. We presented a poster about it at the 19th meeting of Computational Linguistics in the Netherlands (CLIN).
Previous projects
Automatic Speech Recognition (2004-2006)
From 2004 to 2006, I worked as a researcher on the Flexible Large Vocabulary Recognition project (FLaVoR), which was a cooperation between CNTS and the ESAT/PSI speech group (University of Leuven), funded by the Flemish government (IWT).
At the Center for Dutch language and speech (CNTS), I did morpho-fonological research in speech variation and variants in Dutch. That knowledge was integrated in a Speech Recogniser in order to optimise the recognition of spontaneous speech. The Spoken Dutch Corpus (CGN) forms the empirical basis for this research. CGN contains a lot of spontaneous speech. For parts of the sound fragments, a broad phonetic transcription is available. That transcription contains typical features of spontaneous speech like vowel reduction or insertion and consonant deletion. In order to track down features of spontaneous speech, I compared the broad phonetic transcriptions from the CGN with a phonological transcription of the sound fragments. Based on FONILEX and CELEX, every word form gets a standard phonological transcription. Deviations between the broad phonetic transcriptions and the standardised ones point out speech variation.
MA Thesis in Authorship Attribution (2002-2004)
My MA thesis Syntax-based Features and Machine Learning Techniques for Authorship Attribution was supervised by Prof. dr. Walter Daelemans. The Belgian newspaper De Standaard devoted an article to my thesis.
[ABSTRACT] In the framework of the domain of Text Mining, and Document Classification in particular, an introduction into the field of Authorship Attribution is presented. Especially syntax-based features and Machine Learning techniques suggest a new path in the identification of the author of a text which has not been subject to elaborate research yet. This paper reports experiments in Authorship Attribution in which the efficiency of syntax-based features is compared with that of lexical and token-based features. Memory Based Shallow Parsing (MBSP), scripts in the AWK programming language and the Rainbow program for statistical text classification are applied for the extraction of style markers. Classification is performed by means of Machine Learning algorithms. The syntax-based features achieve an accuracy of 55% after testing on new data, while a combination of syntax-based, lexical and token-level features performs seven percent better. These results suggest that syntax-based features are good clues but not yet able to identify the author of a text independently. Nevertheless, we believe that extensive research on syntax-based features will lead to success.