GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses the lack of large-scale benchmarks and systematic evaluation protocols for author verification in German. We introduce GerAV, the first multi-source, cross-domain author verification benchmark for German, comprising over 600,000 labeled text pairs from Twitter and Reddit. Leveraging fine-tuned large language models, we conduct both supervised and zero-shot experiments, revealing a trade-off between model specialization and generalization. A hybrid training strategy is employed to enhance cross-scenario performance. Our best model achieves a 0.09 improvement in F1 score over existing baselines, and in zero-shot settings, it outperforms GPT-5 by 0.08, demonstrating GerAV’s effectiveness and its capacity to pose meaningful challenges for future research in German author verification.

Technology Category

Application Category

📝 Abstract

Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 600k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09 absolute F1 score and surpasses GPT-5 in a zero-shot setting by 0.08. We further observe a trade-off between specialization and generalization: models trained on specific data types perform best under matching conditions but generalize less well across data regimes, a limitation that can be mitigated by combining training sources. Overall, GerAV provides a challenging and versatile benchmark for advancing research on German and cross-domain AV.

Problem

Research questions and friction points this paper is trying to address.

Authorship Verification

German

Benchmark

Cross-domain

Large Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

authorship verification

German NLP

fine-tuned LLMs