MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

📅 2026-03-06

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This study addresses the scarcity of high-quality French-language medical instruction data, which hinders effective instruction tuning of large language models in this domain. To bridge this gap, the authors introduce MedInjection-FR, a French biomedical dataset comprising 571,000 instruction–response pairs, and present the first systematic integration and comparative analysis of native, synthetic, and translated data for instruction tuning. Comprehensive experiments based on the Qwen-4B-Instruct model demonstrate that native data yields the best performance, while combining native and translated data provides complementary gains. Although synthetic data alone shows limited efficacy, its balanced integration with native data further enhances model performance. The findings elucidate the synergistic interplay between data authenticity and diversity in shaping model capabilities.

Technology Category

Application Category

📝 Abstract

Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.

Problem

Research questions and friction points this paper is trying to address.

instruction tuning

biomedical

French

data scarcity

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction tuning

biomedical LLMs

multilingual data