ABCD-LINK: Annotation Bootstrapping for Cross-Document Fine-Grained Links

📅 2025-09-01

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Cross-document fine-grained relation annotation data is scarce, severely hindering research and evaluation of automated linking methods. To address this, we propose a domain-agnostic semi-synthetic data construction framework that integrates retrieval models with large language models (LLMs) to enable automatic evaluation, selection, and optimization of cross-document linking methods. Our key contributions are: (i) the first systematic zero-shot selection framework for high-performance linking methods in unseen domains; and (ii) a hybrid validation pipeline—“semi-synthetic generation → automated evaluation → human verification”—balancing scalability and reliability. Evaluated on peer review and news domains, our approach achieves a 78% human-validated accuracy and outperforms strong retrieval baselines by over 100%. We publicly release our code, dataset, and annotation protocol to foster reproducible and scalable research in cross-document understanding.

Technology Category

Application Category

📝 Abstract

Understanding fine-grained relations between documents is crucial for many application domains. However, the study of automated assistance is limited by the lack of efficient methods to create training and evaluation datasets of cross-document links. To address this, we introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links in a new domain from scratch. We first generate and validate semi-synthetic datasets of interconnected documents. This data is used to perform automatic evaluation, producing a shortlist of best-performing linking approaches. These approaches are then used in an extensive human evaluation study, yielding performance estimates on natural text pairs. We apply our framework in two distinct domains -- peer review and news -- and show that combining retrieval models with LLMs achieves 78% link approval from human raters, more than doubling the precision of strong retrievers alone. Our framework enables systematic study of cross-document understanding across application scenarios, and the resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review. We make the code, data, and annotation protocols openly available.

Problem

Research questions and friction points this paper is trying to address.

Lack efficient methods to create cross-document link datasets

Need automated assistance for fine-grained document relations

Require domain-agnostic framework for cross-document annotation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semi-synthetic dataset generation for validation

Automatic evaluation to shortlist best linking approaches

Combining retrieval models with LLMs for precision

🔎 Similar Papers

No similar papers found.