🤖 AI Summary
This study addresses two cross-domain information extraction tasks: insomnia symptom detection in clinical texts and food safety incident extraction in news articles. Methodologically, we propose a domain-aware Transformer framework built upon RoBERTa-style encoders, integrating three key innovations: (1) GPT-4–driven large language model (LLM) data augmentation, (2) task-adaptive input construction, and (3) domain-adaptive fine-tuning. These components collectively enhance domain-specific semantic modeling and few-shot generalization over generic models. Evaluated on SMM4H-HeaRD 2025 Task 5 Subtask 1 (food safety incident extraction), our approach achieves an F1 score of 0.958—ranking first—and attains state-of-the-art performance on the insomnia detection subtask. Our work empirically validates the efficacy of synergistic LLM augmentation and domain-informed encoding, establishing a reusable technical paradigm for low-resource information extraction in healthcare and public safety domains.
📝 Abstract
This paper presents our system for the SMM4H-HeaRD 2025 shared tasks, specifically Task 4 (Subtasks 1, 2a, and 2b) and Task 5 (Subtasks 1 and 2). Task 4 focused on detecting mentions of insomnia in clinical notes, while Task 5 addressed the extraction of food safety events from news articles. We participated in all subtasks and report key findings across them, with particular emphasis on Task 5 Subtask 1, where our system achieved strong performance-securing first place with an F1 score of 0.958 on the test set. To attain this result, we employed encoder-based models (e.g., RoBERTa), alongside GPT-4 for data augmentation. This paper outlines our approach, including preprocessing, model architecture, and subtask-specific adaptations