Cross-Genre Authorship Attribution via LLM-Based Retrieve-and-Rerank

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

168K/year

🤖 AI Summary

Cross-genre author attribution (AA) faces the core challenge of modeling topic-agnostic, author-specific linguistic patterns—rather than relying on genre- or domain-dependent cues. To address this, we propose a large language model–based two-stage retrieval-reranking framework: (1) coarse-grained candidate author retrieval, followed by (2) fine-grained reranking via contrastive learning with carefully constructed hard negative samples, explicitly steering the model to focus on stylistic authorship signals over topical ones. Our key contribution is a topic-decoupled data construction strategy that explicitly suppresses reliance on topic-related features. Evaluated on the HIATUS benchmark (HRS1/HRS2), our method achieves improvements of +22.3 and +34.4 in Success@8, respectively, significantly outperforming prior state-of-the-art approaches. These results demonstrate the effectiveness and robustness of our paradigm for cross-genre AA.

Technology Category

Application Category

📝 Abstract

Authorship attribution (AA) is the task of identifying the most likely author of a query document from a predefined set of candidate authors. We introduce a two-stage retrieve-and-rerank framework that finetunes LLMs for cross-genre AA. Unlike the field of information retrieval (IR), where retrieve-and-rerank is a de facto strategy, cross-genre AA systems must avoid relying on topical cues and instead learn to identify author-specific linguistic patterns that are independent of the text's subject matter (genre/domain/topic). Consequently, for the reranker, we demonstrate that training strategies commonly used in IR are fundamentally misaligned with cross-genre AA, leading to suboptimal behavior. To address this, we introduce a targeted data curation strategy that enables the reranker to effectively learn author-discriminative signals. Using our LLM-based retrieve-and-rerank pipeline, we achieve substantial gains of 22.3 and 34.4 absolute Success@8 points over the previous state-of-the-art on HIATUS's challenging HRS1 and HRS2 cross-genre AA benchmarks.

Problem

Research questions and friction points this paper is trying to address.

Identifying authors across different genres using LLMs

Avoiding topical bias in cross-genre authorship attribution

Improving author recognition through retrieve-and-rerank framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage retrieve-and-rerank framework for authorship attribution

Fine-tunes LLMs to identify author-specific linguistic patterns

Targeted data curation strategy for cross-genre reranking

🔎 Similar Papers

Authorship Attribution in the Era of LLMs: Problems, Methodologies, and Challenges