PhD dissertation "Scalability Issues in Authorship Attribution"

Abstract (in English) or Samenvatting (in Dutch)
If you want a copy of the dissertation, just send me an e-mail.

Proefschrift voorgelegd tot het behalen van de graad van doctor in de Taalkunde aan de Universiteit Antwerpen
Dissertation presented to the jury in order to obtain the academic degree of PhD in Linguistics at the University of Antwerp

by Kim Luyckx



   

PhD dissertation "Scalability Issues in Authorship Attribution"

Tuesday 7 December 2010, 3PM
Promotiezaal Grauwzustersklooster, Lange Sint-Annastraat 7, 2000 Antwerp (Belgium) [campus map; directions]

Jury: Harald Baayen (University of Alberta, Canada), Efstathios Stamatatos (University of the Aegean, Samos, Greece), Véronique Hoste (Hogeschool Gent), Steven Gillis (UA), Dominiek Sandra (president of the jury), Walter Daelemans (promotor, UA)

At the start of the defense, I will give a 30-minute presentation about my dissertation. The defense will be followed by a reception, to which you are also kindly invited.

The defense is public. Please send an e-mail to kim.luyckx@ua.ac.be to confirm your presence (by 1 December).


Invited talks by Harald Baayen and Efstathios Stamatatos [CLiPS-CLIF colloquium]

Monday 6 December 2010, 2PM
E 201, Prinsstraat 13, 2000 Antwerp (Belgium) [campus map; directions]

2 PM: Harald Baayen on Exploring the potential of naive discriminative learning for the analysis of (psycho)linguistic data

In 1972, Rescorla and Wagner formulated recurrence equations for human and animal learning that have proved to be surprisingly fruitful in psychology.  Danks (2003) introduced a technical innovation that makes it possible to very efficiently estimate the state of the learning system when it is in equilibrium.  In my presentation, I will two present examples demonstrating that Rescorla-Wagner-Danks discriminative learning has much to offer for linguistic and psycholinguistic modelling as well as data analysis.

First, I will introduce a computational model predicting lexical decision latencies for visual comprehension based on naive discriminative learning.  The model is very sparse in free parameters, yet explains a wide range of empirical findings, including whole-word and phrasal frequency effects, without having to posit separate representations for complex words or phrases.  In other words, the model combines excellent predictions with extreme representational parsimony.

Second, I will discuss examples where naive discriminative learning appears to out-perform logistic mixed models fitted to the same data.  Furthermore, naive discriminative learning provides the researcher with sufficient detail to pinpoint a potential weakness of the mixed-effect regression modeling approach.  For the data set examined thus far in this line of research, it seems that naive discriminative learning has potential to be developed into a  statistical tool complementing other classifiers such as logistic and polytomous mixed-effects models, random forests, and nearest-neighbor based methods.

Please register here

3:30 PM: Efstathios Stamatatos on Classification Methods in Modern Authorship Analysis

During the last decade, text categorization has been substantially developed providing effective methods able to deal with thousands of documents and multiple categories. Beyond topic, style can also be used as a discriminator factor. Authorship analysis deals with the personal style of the authors of electronic documents. Typical authorship analysis tasks include authorship attribution (a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available), authorship verification (to decide whether a given text was written by a certain author or not), and author profiling (to extract information about the age, education, sex, etc. of the author of a given text). According to the properties of a particular application, a specific setting of text categorization task has to be defined: single-label vs. multi-label classification, hierarchical vs. flat classification, closed-set vs. open-set classification. In this presentation, we examine the main classification paradigms for style-based text categorization focusing on how they treat the writing style: cumulatively for each class or individually for each document. In more detail, we distinguish two main approaches: (i) Profile-based paradigm aiming at extracting only one representation vector per author and (ii) Instance-based paradigm aiming at extracting one representation vector per document. Several state-of-the-art methods are examined and a detailed comparison is provided based on factors such as the type of representation they can handle, the required computational time cost, the ability to handle short texts and imbalanced training data, etc. suggesting their suitability for certain authorship analysis problems.

Please register here


Sponsored by CLiF, the scientific research community for Computational Linguistics and Language Technology.