The peer-review system of most academic conferences relies on the anonymity of both the authors and the reviewers of submissions. In particular with respect to the authors, the anonymity requirement is heavily disputed and pros and cons are discussed exclusively on a qualitative level.
In this paper, we contribute a quantitative argument to this discussion by showing that it is possible for a machine to reveal the identity of authors of scientific publications with high accuracy. We attack the anonymity of authors using statistical analysis of multiple heterogeneous aspects of a paper, such as its citations, its writing style, and its content. We apply several multi-label, multi-class machine learning methods to model the patterns exhibited in each feature category for individual authors and combine them to a single ensemble classifier to deanonymize authors with high accuracy. To the best of our knowledge, this is the first approach that exploits multiple categories of discriminative features and uses multiple, partially complementing classifiers in a single, focused attack on the anonymity of the authors of an academic publication.
We evaluate our author identification framework, deAnon, based on a real-world data set of 3,894 papers. From these papers, we target 1,405 productive authors that each have at least 3 publications in our data set. Our approach returns a ranking of probable authors for anonymous papers, an ordering for guessing the authors of a paper. In our experiments, following this ranking, the first guess corresponds to one of the authors of a paper in 39.7% of the cases, and at least one of the authors is among the top 10 guesses in 65.6% of all cases. Thus, deAnon significantly outperforms current state-of-the-art techniques for automatic deanonymization.