Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the generalization capability of relation extraction (RE) models across datasets, revealing their tendency to rely on dataset-specific spurious correlations rather than robust semantic patterns—leading to severe performance degradation. We propose a cross-dataset evaluation framework to systematically compare fine-tuning, few-shot in-context learning (ICL), and zero-shot baselines, and introduce a data quality diagnostic method. Key findings: (1) High in-distribution performance inversely correlates with cross-dataset generalization, indicating overfitting; (2) Data quality—not lexical similarity—is the primary determinant of transfer success; (3) Fine-tuning excels on high-quality data, whereas ICL demonstrates superior robustness under label noise, with some zero-shot results even surpassing cross-dataset fine-tuning; (4) Structural flaws in prevailing benchmarks—including single-relation constraints and inconsistent negative instance definitions—severely hinder generalization. Our work provides both theoretical insights and practical guidelines for evaluating and training robust RE models.

Technology Category

Application Category

📝 Abstract
Analysing the generalisation capabilities of relation extraction (RE) models is crucial for assessing whether they learn robust relational patterns or rely on spurious correlations. Our cross-dataset experiments find that RE models struggle with unseen data, even within similar domains. Notably, higher intra-dataset performance does not indicate better transferability, instead often signaling overfitting to dataset-specific artefacts. Our results also show that data quality, rather than lexical similarity, is key to robust transfer, and the choice of optimal adaptation strategy depends on the quality of data available: while fine-tuning yields the best cross-dataset performance with high-quality data, few-shot in-context learning (ICL) is more effective with noisier data. However, even in these cases, zero-shot baselines occasionally outperform all cross-dataset results. Structural issues in RE benchmarks, such as single-relation per sample constraints and non-standardised negative class definitions, further hinder model transferability.
Problem

Research questions and friction points this paper is trying to address.

Assessing generalization limits of RE models in biographical data
Evaluating impact of data quality on model transferability
Identifying structural issues in RE benchmarks hindering performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-dataset experiments reveal RE models' generalization limits
Data quality, not lexical similarity, ensures robust transfer
Few-shot ICL outperforms fine-tuning with noisy data
🔎 Similar Papers
No similar papers found.