WHERE and WHICH: Iterative Debate for Biomedical Synthetic Data Augmentation

📅 2025-03-31

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Biomedical NLP suffers from a scarcity of high-quality annotated data, hindering accurate semantic relationship modeling between entities. Existing synthetic data augmentation methods rely on word-level substitution, often generating counterfactual samples that compromise semantic fidelity and relational consistency. To address this, we propose a biomedical relation–aware synthetic data augmentation framework. It introduces a relation-aware similarity metric and employs a multi-agent iterative debate mechanism to precisely identify substitution locations (WHERE) and candidate entities (WHICH); relational preservation is further ensured via reflective iterative refinement. The framework integrates biomedical ontology embeddings and domain knowledge into an end-to-end synthetic data pipeline. Evaluated across nine benchmarks from BLURB and BigBIO, it achieves average F1 improvements of 2.1–4.7 points on four tasks—including relation extraction and named entity recognition—outperforming state-of-the-art methods.

Technology Category

Application Category

📝 Abstract

In Biomedical Natural Language Processing (BioNLP) tasks, such as Relation Extraction, Named Entity Recognition, and Text Classification, the scarcity of high-quality data remains a significant challenge. This limitation poisons large language models to correctly understand relationships between biological entities, such as molecules and diseases, or drug interactions, and further results in potential misinterpretation of biomedical documents. To address this issue, current approaches generally adopt the Synthetic Data Augmentation method which involves similarity computation followed by word replacement, but counterfactual data are usually generated. As a result, these methods disrupt meaningful word sets or produce sentences with meanings that deviate substantially from the original context, rendering them ineffective in improving model performance. To this end, this paper proposes a biomedical-dedicated rationale-based synthetic data augmentation method. Beyond the naive lexicon similarity, specific bio-relation similarity is measured to hold the augmented instance having a strong correlation with bio-relation instead of simply increasing the diversity of augmented data. Moreover, a multi-agents-involved reflection mechanism helps the model iteratively distinguish different usage of similar entities to escape falling into the mis-replace trap. We evaluate our method on the BLURB and BigBIO benchmark, which includes 9 common datasets spanning four major BioNLP tasks. Our experimental results demonstrate consistent performance improvements across all tasks, highlighting the effectiveness of our approach in addressing the challenges associated with data scarcity and enhancing the overall performance of biomedical NLP models.

Problem

Research questions and friction points this paper is trying to address.

Addresses scarcity of high-quality biomedical NLP data

Improves synthetic data augmentation for bio-relation accuracy

Prevents counterfactual data generation in biomedical texts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bio-relation similarity for data augmentation

Multi-agent reflection to prevent mis-replacement

Rationale-based synthetic data for BioNLP

🔎 Similar Papers

No similar papers found.