MPerS: Dynamic MLLM MixExperts Perception-Guided Remote Sensing Scene Segmentation

📅 2026-05-11
📈 Citations: 0
Influential: 0
📄 PDF

career value

180K/year
🤖 AI Summary
This work addresses the underutilization of high-quality image captions and their multimodal semantic fusion potential in existing remote sensing scene segmentation methods. It introduces, for the first time, multimodal large language models—such as LLaVA, ChatGPT, and Qwen—from a multi-expert perspective to generate precise image descriptions grounded in remote sensing expertise. These textual descriptions are dynamically fused with DINOv3 visual features through an adaptive MixExperts module, and fine-grained segmentation is achieved via a language-query-guided attention mechanism. The proposed approach achieves state-of-the-art performance on three public remote sensing semantic segmentation benchmarks, significantly enhancing the ability of textual semantics to guide visual perception.
📝 Abstract
The multimodal fusion of images and scene captions has been extensively explored and applied in various fields. However, when dealing with complex remote sensing (RS) scenes, existing studies have predominantly concentrated on architectural optimizations for integrating textual semantic information with visual features, while largely neglecting the generation of high-quality RS captions and the investigation of their effectiveness in multimodal semantic fusion.In this context, we propose the Dynamic MLLM Mixture-of-Experts Perception-Guided Remote Sensing Scene Segmentation, referred to as MPerS.We design multiple prompts for MLLMs to generate high-quality RS captions, enabling MLLMs to perceive RS scenes from diverse expert perspectives. DINOv3 is employed to extract dense visual representations of land-covers.We design a Dynamic MixExperts module that adaptively integrates the most effective textual semantics. Linguistic Query Guided Attention is constructed to utilize textual semantic information to guide visual features for precise segmentation. The MLLMs include LLaVA, ChatGPT, and Qwen. Our method achieves superior performance on three public semantic segmentation RS datasets.
Problem

Research questions and friction points this paper is trying to address.

remote sensing
scene segmentation
multimodal fusion
caption generation
semantic segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic MixExperts
Perception-Guided Segmentation
Multimodal LLMs
Remote Sensing Captioning
Linguistic Query Guided Attention
🔎 Similar Papers
No similar papers found.