Learning Domain Knowledge in Multimodal Large Language Models through Reinforcement Fine-Tuning

πŸ“… 2026-01-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limited performance of multimodal large language models (MLLMs) in specialized domains such as remote sensing and medical imaging, where merely injecting domain knowledge through textual prompts proves insufficient. To overcome this limitation, the authors propose a reinforcement-based fine-tuning framework that, for the first time, encodes domain-specific knowledge directly into the model’s training objective as hierarchical constraints and reward signals, thereby enabling internalization of expert knowledge. This approach transcends the conventional paradigm of input-level knowledge injection and achieves state-of-the-art performance across multiple multimodal benchmarks in both remote sensing and medical domains, demonstrating significant improvements over existing methods.

Technology Category

Application Category

πŸ“ Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities in multimodal perception and understanding tasks. However, their effectiveness in specialized domains, such as remote sensing and medical imaging, remains limited. A natural approach to domain adaptation is to inject domain knowledge through textual instructions, prompts, or auxiliary captions. Surprisingly, we find that such input-level domain knowledge injection yields little to no improvement on scientific multimodal tasks, even when the domain knowledge is explicitly provided. This observation suggests that current MLLMs fail to internalize domain-specific priors through language alone, and that domain knowledge must be integrated at the optimization level. Motivated by this insight, we propose a reinforcement fine-tuning framework that incorporates domain knowledge directly into the learning objective. Instead of treating domain knowledge as descriptive information, we encode it as domain-informed constraints and reward signals, shaping the model's behavior in the output space. Extensive experiments across multiple datasets in remote sensing and medical domains consistently demonstrate good performance gains, achieving state-of-the-art results on multimodal domain tasks. Our results highlight the necessity of optimization-level domain knowledge integration and reveal a fundamental limitation of textual domain conditioning in current MLLMs.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Domain Knowledge
Reinforcement Fine-Tuning
Remote Sensing
Medical Imaging
Innovation

Methods, ideas, or system contributions that make the work stand out.

reinforcement fine-tuning
domain knowledge integration
multimodal large language models
optimization-level adaptation
domain-informed constraints
πŸ”Ž Similar Papers
No similar papers found.
Q
Qinglong Cao
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China
Yuntian Chen
Yuntian Chen
Eastern Institute of Technology, Ningbo (EIT)
Knowledge DiscoveryFluid MechanicsEnergyAI4SScientific Machine Learning
Chao Ma
Chao Ma
Professor, Shanghai Jiao Tong University
Computer visionMachine learningImage processing
X
Xiaokang Yang
MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China