🤖 AI Summary
Cross-document fine-grained relation annotation data is scarce, severely hindering research and evaluation of automated linking methods. To address this, we propose a domain-agnostic semi-synthetic data construction framework that integrates retrieval models with large language models (LLMs) to enable automatic evaluation, selection, and optimization of cross-document linking methods. Our key contributions are: (i) the first systematic zero-shot selection framework for high-performance linking methods in unseen domains; and (ii) a hybrid validation pipeline—“semi-synthetic generation → automated evaluation → human verification”—balancing scalability and reliability. Evaluated on peer review and news domains, our approach achieves a 78% human-validated accuracy and outperforms strong retrieval baselines by over 100%. We publicly release our code, dataset, and annotation protocol to foster reproducible and scalable research in cross-document understanding.
📝 Abstract
Understanding fine-grained relations between documents is crucial for many application domains. However, the study of automated assistance is limited by the lack of efficient methods to create training and evaluation datasets of cross-document links. To address this, we introduce a new domain-agnostic framework for selecting a best-performing approach and annotating cross-document links in a new domain from scratch. We first generate and validate semi-synthetic datasets of interconnected documents. This data is used to perform automatic evaluation, producing a shortlist of best-performing linking approaches. These approaches are then used in an extensive human evaluation study, yielding performance estimates on natural text pairs. We apply our framework in two distinct domains -- peer review and news -- and show that combining retrieval models with LLMs achieves 78% link approval from human raters, more than doubling the precision of strong retrievers alone. Our framework enables systematic study of cross-document understanding across application scenarios, and the resulting novel datasets lay foundation for numerous cross-document tasks like media framing and peer review. We make the code, data, and annotation protocols openly available.