🤖 AI Summary
To address the weak generalization of multilingual author representations and poor cross-lingual transfer performance of monolingual models, this paper proposes the first end-to-end multilingual author identification framework. Our method introduces probabilistic content masking and language-aware batching to explicitly decouple writing style from semantic content, thereby mitigating cross-lingual interference and enhancing contrastive learning stability. Built upon a multilingual pre-trained architecture, the model is jointly optimized on a large-scale dataset comprising over 4.5 million authors across 36 languages and 13 domains. Experimental results demonstrate substantial improvements: on 22 non-English languages, the proposed approach outperforms monolingual baselines in 21, achieving an average Recall@8 gain of 4.85% (up to 15.91%), validating the significant benefit of multilingual joint modeling for author style representation.
📝 Abstract
Authorship representation (AR) learning, which models an author's unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings-mostly in English-leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model's improved performance.