Classification Methods in Modern Authorship Analysis
During the last decade, text categorization has been substantially developed providing effective methods able to deal with thousands of documents and multiple categories. Beyond topic, style can also be used as a discriminator factor. Authorship analysis deals with the personal style of the authors of electronic documents. Typical authorship analysis tasks include authorship attribution (a text of unknown authorship is assigned to one candidate author, given a set of candidate authors for whom text samples of undisputed authorship are available), authorship verification (to decide whether a given text was written by a certain author or not), and author profiling (to extract information about the age, education, sex, etc. of the author of a given text). According to the properties of a particular application, a specific setting of text categorization task has to be defined: single-label vs. multi-label classification, hierarchical vs. flat classification, closed-set vs. open-set classification.
In this presentation, we examine the main classification paradigms for style-based text categorization focusing on how they treat the writing style: cumulatively for each class or individually for each document. In more detail, we distinguish two main approaches: (i) Profile-based paradigm aiming at extracting only one representation vector per author and (ii) Instance-based paradigm aiming at extracting one representation vector per document. Several state-of-the-art methods are examined and a detailed comparison is provided based on factors such as the type of representation they can handle, the required computational time cost, the ability to handle short texts and imbalanced training data, etc. suggesting their suitability for certain authorship analysis problems.