🤖 AI Summary
Existing benchmarks for reference extraction and parsing are largely confined to well-formatted English bibliographies, struggling to handle the multilingualism, footnote-embedded citations, abbreviations, and diverse historical citation styles prevalent in social sciences and humanities. This work introduces the first unified evaluation benchmark tailored to this domain, comprising three real-world datasets. It systematically assesses the performance of large language models—including DeepSeek-V3.1, Mistral-Small, Gemma-3, and Qwen3-VL—on reference extraction, parsing, and end-to-end document processing tasks, with GROBID serving as a strong baseline. The study further proposes a hybrid deployment strategy combining lightweight LoRA fine-tuning with task routing, which achieves near-saturated extraction performance on medium-sized models and significantly enhances parsing accuracy and system robustness in complex citation scenarios.
📝 Abstract
Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.