π€ AI Summary
This study addresses the scarcity of high-quality German medical corpora and the underperformance of small domain-specific models compared to large general-purpose ones. To bridge this gap, the authors construct FineMed-de, a high-quality German medical corpus, and apply continued pretraining followed by model merging on mainstream large language models with 7Bβ24B parameters, yielding the DeFineMed model family. This work presents the first instance in German medical NLP where a compact specialized model surpasses larger general models, establishing a compliant and reproducible domain adaptation pipeline. Experiments demonstrate that a 7B DeFineMed model based on Qwen2.5 significantly outperforms Mistral-Small-24B-Instruct on German medical benchmarks, achieving approximately 3.5Γ higher win rates, thereby validating the feasibility of resource-efficient specialized models. The study also reveals that while model merging restores instruction-following capabilities, it introduces new challenges such as language mixing and redundancy.
π Abstract
This paper narrows the performance gap between small, specialized models and significantly larger general-purpose models through domain adaptation via continual pre-training and merging. We address the scarcity of specialized non-English data by constructing a high-quality German medical corpus (FineMed-de) from FineWeb2. This corpus is used to continually pre-train and merge three well-known LLMs (ranging from $7B$ to $24B$ parameters), creating the DeFineMed model family. A comprehensive evaluation confirms that specialization dramatically enhances $7B$ model performance on German medical benchmarks. Furthermore, the pairwise win-rate analysis of the Qwen2.5-based models demonstrates an approximately $3.5$-fold increase in the win-rate against the much larger Mistral-Small-24B-Instruct through domain adaptation. This evidence positions specialized $7B$ models as a competitive, resource-efficient solution for complex medical instruction-following tasks. While model merging successfully restores instruction-following abilities, a subsequent failure mode analysis reveals inherent trade-offs, including the introduction of language mixing and increased verbosity, highlighting the need for more targeted fine-tuning in future work. This research provides a robust, compliant methodology for developing specialized LLMs, serving as the foundation for practical use in German-speaking healthcare contexts.