OctoMed: Data Recipes for State-of-the-Art Multimodal Medical Reasoning

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Medical large language models (LLMs) often exhibit limited generalization and robustness when deployed on unseen clinical tasks. Method: We propose a structured reasoning trajectory–based data curation strategy that enables models to dynamically adapt inference path length to downstream tasks, facilitating unsupervised self-calibration of the reasoning process. The model is trained via supervised fine-tuning (SFT) on a large-scale multimodal medical dataset comprising 8 million samples and 6.8 billion response tokens, explicitly covering diverse reasoning lengths. Contribution/Results: The resulting multimodal medical reasoning model achieves state-of-the-art (SOTA) performance among open-source models on multiple out-of-distribution benchmarks. It demonstrates significantly improved cross-task generalization and enhanced clinical applicability, validating the efficacy of adaptive, length-variable reasoning trajectories in medical AI systems.

Technology Category

Application Category

📝 Abstract

High-quality and carefully curated data is a cornerstone of training medical large language models, as it directly impacts both generalization and robustness to unseen clinical tasks. We investigate strategies for training and data curation to develop a robust multimodal reasoning model in the medical domain. Our work focuses on supervised fine-tuning (SFT) and explores data recipes that leverage structured reasoning traces. Using our proposed data recipe, we scale experiments to a dataset of over 8 million examples and 6.8 billion response tokens, achieving state-of-the-art performance among open-source models across diverse out-of-distribution medical benchmark tasks. Our results further indicate that curating a high-quality, diverse training dataset with varying structured reasoning trace lengths enables the fine-tuned model to self-calibrate its reasoning trajectory lengths based on the downstream task, without explicit supervision. We present key insights, describe the data curation strategy, and outline next steps toward developing robust medical vision-language reasoning system.

Problem

Research questions and friction points this paper is trying to address.

Developing robust multimodal reasoning models for medical domain applications

Creating data recipes to improve generalization on unseen clinical tasks

Enabling models to self-calibrate reasoning trajectories without explicit supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses supervised fine-tuning with structured reasoning traces

Scales training to 8 million examples and 6.8 billion tokens

Enables model self-calibration of reasoning trajectory lengths

🔎 Similar Papers

No similar papers found.