PART: Pre-trained Authorship Representation Transformer

📅 2022-09-30

🏛️ arXiv.org

📈 Citations: 8

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address weak generalization in semantic modeling and poor transferability in cross-domain author identification, this paper proposes a contrastive learning–driven pretraining framework for author-style embeddings. Unlike conventional semantic representations, our approach directly models transferable author-identity features via unsupervised pretraining on large-scale, heterogeneous text corpora—including literary works, blogs, and corporate emails. To our knowledge, this is the first work to achieve zero-shot author identification, attaining 72.39% accuracy on a 250-author benchmark—outperforming RoBERTa by 54–56%. Moreover, the learned embedding space supports linear decoding of sociodemographic attributes (e.g., gender, age, occupation), with both visualization and quantitative analysis confirming its highly structured representational capacity. This work breaks the longstanding paradigm wherein author identification relies heavily on domain-specific labeled data and fine-tuning, establishing a novel pathway for zero-shot stylistic modeling.

📝 Abstract

Authors writing documents imprint identifying information within their texts: vocabulary, registry, punctuation, misspellings, or even emoji usage. Previous works use hand-crafted features or classification tasks to train their authorship models, leading to poor performance on out-of-domain authors. Using stylometric representations is more suitable, but this by itself is an open research challenge. In this paper, we propose PART, a contrastively trained model fit to learn extbf{authorship embeddings} instead of semantics. We train our model on ~1.5M texts belonging to 1162 literature authors, 17287 blog posters and 135 corporate email accounts; a heterogeneous set with identifiable writing styles. We evaluate the model on current challenges, achieving competitive performance. We also evaluate our model on test splits of the datasets achieving zero-shot 72.39% accuracy when bounded to 250 authors, a 54% and 56% higher than RoBERTa embeddings. We qualitatively assess the representations with different data visualizations on the available datasets, observing features such as gender, age, or occupation of the author.

Problem

Research questions and friction points this paper is trying to address.

Improving authorship identification using stylometric representations

Addressing poor performance on out-of-domain authors

Learning authorship embeddings instead of semantic features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastively trained model for authorship embeddings

Pre-trained on diverse 1.5M texts

Achieves 72.39% zero-shot accuracy

🔎 Similar Papers

Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges