RS-MoE: A Vision-Language Model with Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering

📅 2024-11-03

📈 Citations: 2

✨ Influential: 0

career value

182K/year

🤖 AI Summary

To address the limited caption diversity in remote sensing image captioning (RSIC) and suboptimal performance in remote sensing visual question answering (RSVQA), this paper introduces RS-MoE—the first Mixture-of-Experts (MoE) vision-language model tailored for remote sensing. Methodologically, RS-MoE features: (1) a remote sensing–specific MoE block integrating multiple lightweight LLM experts; (2) an instruction-aware routing mechanism that dynamically generates task-specific prompts and dispatches experts in parallel; and (3) a two-stage sparse fine-tuning strategy to ensure training stability and generalization. Experiments demonstrate that RS-MoE-1B achieves state-of-the-art (SOTA) performance on RSICap, matching that of 13B-scale general-purpose VLMs. Moreover, without any task-specific adaptation, it attains SOTA zero-shot transfer performance on RSVQA, confirming its strong cross-task generalization capability.

Technology Category

Application Category

📝 Abstract

Remote Sensing Image Captioning (RSIC) presents unique challenges and plays a critical role in applications. Traditional RSIC methods often struggle to produce rich and diverse descriptions. Recently, with advancements in VLMs, efforts have emerged to integrate these models into the remote sensing domain and to introduce descriptive datasets specifically designed to enhance VLM training. This paper proposes RS-MoE, a first Mixture of Expert based VLM specifically customized for remote sensing domain. Unlike traditional MoE models, the core of RS-MoE is the MoE Block, which incorporates a novel Instruction Router and multiple lightweight Large Language Models (LLMs) as expert models. The Instruction Router is designed to generate specific prompts tailored for each corresponding LLM, guiding them to focus on distinct aspects of the RSIC task. This design not only allows each expert LLM to concentrate on a specific subset of the task, thereby enhancing the specificity and accuracy of the generated captions, but also improves the scalability of the model by facilitating parallel processing of sub-tasks. Additionally, we present a two-stage training strategy for tuning our RS-MoE model to prevent performance degradation due to sparsity. We fine-tuned our model on the RSICap dataset using our proposed training strategy. Experimental results on the RSICap dataset, along with evaluations on other traditional datasets where no additional fine-tuning was applied, demonstrate that our model achieves state-of-the-art performance in generating precise and contextually relevant captions. Notably, our RS-MoE-1B variant achieves performance comparable to 13B VLMs, demonstrating the efficiency of our model design. Moreover, our model demonstrates promising generalization capabilities by consistently achieving state-of-the-art performance on the Remote Sensing Visual Question Answering (RSVQA) task.

Problem

Research questions and friction points this paper is trying to address.

Enhances remote sensing image captioning accuracy.

Introduces Mixture of Experts for VLM optimization.

Improves scalability and task-specific focus in RSIC.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture of Experts for VLM

Instruction Router for LLMs

Two-stage training strategy

🔎 Similar Papers

No similar papers found.