MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of high-quality French-language medical instruction data, which hinders effective instruction tuning of large language models in this domain. To bridge this gap, the authors introduce MedInjection-FR, a French biomedical dataset comprising 571,000 instruction–response pairs, and present the first systematic integration and comparative analysis of native, synthetic, and translated data for instruction tuning. Comprehensive experiments based on the Qwen-4B-Instruct model demonstrate that native data yields the best performance, while combining native and translated data provides complementary gains. Although synthetic data alone shows limited efficacy, its balanced integration with native data further enhances model performance. The findings elucidate the synergistic interplay between data authenticity and diversity in shaping model capabilities.

Technology Category

Application Category

📝 Abstract
Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.
Problem

Research questions and friction points this paper is trying to address.

instruction tuning
biomedical
French
data scarcity
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction tuning
biomedical LLMs
multilingual data
data provenance
French medical NLP
🔎 Similar Papers
No similar papers found.
I
Ikram Belmadani
Aix-Marseille Univ., CNRS, LIS UMR 7020, 13000 Marseille, France
O
Oumaima El Khettari
Nantes Univ., École Centrale Nantes, CNRS, LS2N, UMR 6004, 44000 Nantes, France
P
Pacôme Constant dit Beaufils
Nantes Univ., CHU Nantes, PHU 11: Santé Publique, Clinique des données, INSERM, CIC 1413, 44000 Nantes, France; Nantes Univ., CNRS, INSERM, L’institut du thorax, 44000 Nantes, France
Benoit Favre
Benoit Favre
Professeur CNU 27, LIS UMR 7020, Aix-Marseille University
Natural Language ProcessingSpoken Language UnderstandingParsingMachine Learning
Richard Dufour
Richard Dufour
LS2N - TALN/NLP research group - Nantes University
Natural language processingBiomedical domainLanguage modelingSpontaneous speech