TinyRS-R1: Compact Multimodal Language Model for Remote Sensing

📅 2025-05-17

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

Edge-deployable remote sensing applications are constrained by limited computational resources, making it infeasible to deploy 7B-scale multimodal large language models (MLLMs). To address this, we propose TinyRS—the first domain-specific, lightweight multimodal small language model with 2B parameters—and its inference-enhanced variant, TinyRS-R1, supporting image understanding, visual question answering (VQA), and spatial localization. Our method innovatively integrates chain-of-thought (CoT) reasoning and group-relative policy optimization (GRPO) into remote sensing small-model training for the first time, built upon the Qwen2-VL-2B architecture and a novel four-stage training paradigm: satellite image pretraining → vision instruction tuning → CoT fine-tuning → GRPO alignment. Experiments demonstrate that TinyRS-R1 consistently outperforms mainstream 7B remote sensing MLLMs across classification, VQA, visual grounding, and open-ended QA tasks, while reducing memory footprint and inference latency to approximately one-third of those of its 7B counterparts.

Technology Category

Application Category

📝 Abstract

Remote-sensing applications often run on edge hardware that cannot host today's 7B-parameter multimodal language models. This paper introduces TinyRS, the first 2B-parameter multimodal small language model (MSLM) optimized for remote sensing tasks, and TinyRS-R1, its reasoning-augmented variant. Built upon Qwen2-VL-2B, TinyRS is trained through a four-stage pipeline: pre-training on million satellite images, instruction tuning on visual instruction examples, fine-tuning with Chain-of-Thought (CoT) annotations from the proposed reasoning dataset, and alignment via Group Relative Policy Optimization (GRPO). TinyRS-R1 achieves or surpasses the performance of recent 7B-parameter remote sensing models across classification, VQA, visual grounding, and open-ended question answering-while requiring just one-third of the memory and latency. Our analysis shows that CoT reasoning substantially benefits spatial grounding and scene understanding, while the non-reasoning TinyRS excels in concise, latency-sensitive VQA tasks. TinyRS-R1 represents the first domain-specialized MSLM with GRPO-aligned CoT reasoning for general-purpose remote sensing.

Problem

Research questions and friction points this paper is trying to address.

Develops a compact 2B-parameter multimodal model for edge-based remote sensing

Enhances reasoning via Chain-of-Thought for spatial grounding and scene understanding

Optimizes memory and latency to outperform larger 7B-parameter models

Innovation

Methods, ideas, or system contributions that make the work stand out.

2B-parameter multimodal small language model

Four-stage training pipeline with GRPO alignment

Reasoning-augmented variant for spatial grounding

🔎 Similar Papers

No similar papers found.