MilChat: Introducing Chain of Thought Reasoning and GRPO to a Multimodal Small Language Model for Remote Sensing

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing general-purpose multimodal large language models (MLLMs) exhibit low accuracy, high false-positive rates, and excessive computational overhead in domain-specific remote sensing tasks—e.g., detecting missile launch sites in remote regions. Method: We propose MilChat, a lightweight domain-specific MLLM. Leveraging the expert-annotated, fine-grained military remote sensing dataset MilData, we introduce, for the first time, a synergistic integration of chain-of-thought (CoT) reasoning and group relative policy optimization (GRPO) into small-scale MLLM training, combined with text–image alignment modeling and supervised fine-tuning. Contributions/Results: (1) The first annotated dataset and benchmark tailored for military remote sensing; (2) A novel CoT+GRPO joint optimization paradigm; (3) State-of-the-art performance on MilData—achieving 80% recall and 98% precision—across both classification and open-ended generation tasks.

Technology Category

Application Category

📝 Abstract
Remarkable capabilities in understanding and generating text-image content have been demonstrated by recent advancements in multimodal large language models (MLLMs). However, their effectiveness in specialized domains-particularly those requiring resource-efficient and domain-specific adaptations-has remained limited. In this work, a lightweight multimodal language model termed MilChat is introduced, specifically adapted to analyze remote sensing imagery in secluded areas, including challenging missile launch sites. A new dataset, MilData, was compiled by verifying hundreds of aerial images through expert review, and subtle military installations were highlighted via detailed captions. Supervised fine-tuning on a 2B-parameter open-source MLLM with chain-of-thought (CoT) reasoning annotations was performed, enabling more accurate and interpretable explanations. Additionally, Group Relative Policy Optimization (GRPO) was leveraged to enhance the model's ability to detect critical domain-specific cues-such as defensive layouts and key military structures-while minimizing false positives on civilian scenes. Through empirical evaluations, it has been shown that MilChat significantly outperforms both larger, general-purpose multimodal models and existing remote sensing-adapted approaches on open-ended captioning and classification metrics. Over 80% recall and 98% precision were achieved on the newly proposed MilData benchmark, underscoring the potency of targeted fine-tuning and reinforcement learning in specialized real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Enhancing domain-specific multimodal analysis for remote sensing
Improving accuracy in detecting military installations and structures
Reducing false positives in civilian scene classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight multimodal model for remote sensing
Chain-of-thought reasoning for interpretable explanations
GRPO optimization for domain-specific cue detection
🔎 Similar Papers
No similar papers found.
A
Aybora Koksal
Center for the Image Analysis and Department of Electrical and Electronics Engineering of Middle East Technical University (METU), Ankara, Turkey
A. Aydin Alatan
A. Aydin Alatan
Dept. of EE Eng., Center for Image Analysis (OGAM), METU
Image ProcessingLearning & RecognitionVision