GeoPix: Multi-Modal Large Language Model for Pixel-level Image Understanding in Remote Sensing

📅 2025-01-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current remote sensing multimodal large models lack pixel-level instruction understanding capability, primarily due to the absence of large-scale pixel-annotated datasets and architecture designs tailored for fine-grained spatial reasoning. To address this, we propose the first pixel-aware multimodal large model for remote sensing, accompanied by GeoPixInstruct—the first large-scale remote sensing pixel-level instruction dataset (65K images, 140K instances). We introduce a remote sensing–specific mask predictor and a class-aware learnable memory module to enable multi-scale, fine-grained object understanding. Our architecture integrates a vision encoder, large language model, and mask decoder, trained via two-stage collaborative learning with joint text–mask supervision. Experiments demonstrate significant improvements over baselines on pixel-level segmentation tasks, while maintaining competitive performance on image-level and region-level benchmarks (e.g., VQA, captioning). This work establishes the first unified framework for multi-granularity understanding in remote sensing.

Technology Category

Application Category

📝 Abstract
Multi-modal large language models (MLLMs) have achieved remarkable success in image- and region-level remote sensing (RS) image understanding tasks, such as image captioning, visual question answering, and visual grounding. However, existing RS MLLMs lack the pixel-level dialogue capability, which involves responding to user instructions with segmentation masks for specific instances. In this paper, we propose GeoPix, a RS MLLM that extends image understanding capabilities to the pixel level. This is achieved by equipping the MLLM with a mask predictor, which transforms visual features from the vision encoder into masks conditioned on the LLM's segmentation token embeddings. To facilitate the segmentation of multi-scale objects in RS imagery, a class-wise learnable memory module is integrated into the mask predictor to capture and store class-wise geo-context at the instance level across the entire dataset. In addition, to address the absence of large-scale datasets for training pixel-level RS MLLMs, we construct the GeoPixInstruct dataset, comprising 65,463 images and 140,412 instances, with each instance annotated with text descriptions, bounding boxes, and masks. Furthermore, we develop a two-stage training strategy to balance the distinct requirements of text generation and masks prediction in multi-modal multi-task optimization. Extensive experiments verify the effectiveness and superiority of GeoPix in pixel-level segmentation tasks, while also maintaining competitive performance in image- and region-level benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
Remote Sensing Imagery
Pixel-level Tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

GeoPix
Multi-modal Large Language Model
Pixel-level Task Execution
🔎 Similar Papers
No similar papers found.
Ruizhe Ou
Ruizhe Ou
Pattern Recognition and Intelligent System Lab, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Yuan Hu
Yuan Hu
Peking University
deep learningcomputer visionremote sensing
F
Fan Zhang
Institute of Remote Sensing and Geographic Information Systems, School of Earth and Space Sciences, Peking University, Beijing 100871, China
J
Jiaxin Chen
Pattern Recognition and Intelligent System Lab, School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China
Y
Yu Liu
Institute of Remote Sensing and Geographic Information Systems, School of Earth and Space Sciences, Peking University, Beijing 100871, China; Peking University Ordos Research Institute of Energy, Ordos 017000, China