🤖 AI Summary
This study addresses the limited generalization of joint entity and relation extraction models caused by low-quality training data, a problem exacerbated by existing data augmentation techniques that often distort textual semantic structures. To overcome this, the authors propose SSDAU, a structured semantic data augmentation framework that segments input texts according to entity labels and employs a context-aware encoder to extract semantic features for reconstructing high-fidelity samples. SSDAU innovatively integrates contextualized embeddings with conventional similarity metrics to disambiguate semantically similar entities and incorporates the BERTTopic model to filter out irrelevant topics, thereby preserving both semantic consistency and topical coherence in augmented data. Experimental results demonstrate that SSDAU significantly outperforms seven state-of-the-art methods across multiple benchmark datasets, exhibiting remarkable robustness—its F1 score declines by only 8.26% under ambiguity, substantially less than the average 31.91% drop observed in baseline approaches.
📝 Abstract
Joint Entity and Relation Extraction (JERE) is highly susceptible to weak generalization due to low-quality training data.
Data augmentation is a common strategy to enhance model generalization across different domains.
However, existing data augmentation methods often overlook text relevance and may disrupt semantic structures and dependencies, making it difficult to generate effective augmented data for improving model generalization.
In this paper, we propose Structured Semantic Data Augmentation (SSDAU), a novel method designed to preserve the semantic structure of text during augmentation.
SSDAU segments text based on entity labels and employs an encoder to capture semantic features of entities through context awareness.
It then performs entity semantic restructuring to generate augmented data.
To distinguish semantically similar entities, SSDAU fuses contextualized embeddings with traditional similarity scores.
To mitigate potential topic ambiguity and information loss, we apply the BERTTopic model to filter out irrelevant topics, ensuring topic consistency.
We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular data augmentation baselines.
Experiments demonstrate that SSDAU generates semantically consistent data with superior robustness against ambiguity (8.26\% F1 decrease vs.\ 31.91\% for baselines), significantly outperforming all existing methods across all metrics.