🤖 AI Summary
Cross-genre author attribution (AA) faces the core challenge of modeling topic-agnostic, author-specific linguistic patterns—rather than relying on genre- or domain-dependent cues. To address this, we propose a large language model–based two-stage retrieval-reranking framework: (1) coarse-grained candidate author retrieval, followed by (2) fine-grained reranking via contrastive learning with carefully constructed hard negative samples, explicitly steering the model to focus on stylistic authorship signals over topical ones. Our key contribution is a topic-decoupled data construction strategy that explicitly suppresses reliance on topic-related features. Evaluated on the HIATUS benchmark (HRS1/HRS2), our method achieves improvements of +22.3 and +34.4 in Success@8, respectively, significantly outperforming prior state-of-the-art approaches. These results demonstrate the effectiveness and robustness of our paradigm for cross-genre AA.
📝 Abstract
Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.