SkyNative: A Native Multimodal Framework for Remote Sensing Visual Evidence Reasoning

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

174K/year
🤖 AI Summary
This work addresses the limitations of existing remote sensing vision-language models, which rely on pretrained visual encoders and are prone to language priors that obscure fine-grained visual evidence. To overcome this, the authors propose a native multimodal framework that eschews conventional visual backbones and instead tokenizes remote sensing images directly into raw image patches. Within a unified autoregressive architecture, they introduce a modality-aware disentanglement mechanism to enable deep fusion of visual and textual representations. The approach facilitates native patch-level modeling and is accompanied by a newly constructed benchmark for evaluating visual grounding capabilities. Experiments demonstrate that the model significantly enhances image grounding performance on both standard remote sensing understanding tasks and large-scale spatial reasoning scenarios, while exhibiting greater robustness against misleading textual prompts.
📝 Abstract
Remote sensing vision-language models commonly rely on pretrained visual encoders to convert images into semantic features before language-model reasoning. While effective for scene-level understanding, this pipeline may prematurely compress local visual evidence, making fine-grained spatial reasoning vulnerable to language priors, especially in ultra-high-resolution remote sensing imagery. We present SkyNative, a native multimodal framework for remote sensing that adopts an encoder-free architecture, removing the pretrained visual backbone to directly represent images as raw patch tokens in the language-model token space. To reconcile low-level visual patches with textual tokens, SkyNative introduces a modality-aware decoupling mechanism that uses modality-specific parameters within a unified autoregressive backbone. We further introduce a visual reliance benchmark that diagnoses whether models ground their answers in image evidence through progressive visual degradation and misleading textual prompts. Across standard remote sensing understanding tasks and large-format spatial reasoning evaluations, SkyNative shows stronger image-grounded perception and improved robustness against prompt-induced language priors. These results suggest that native patch-level multimodal modeling is a promising direction for reliable remote sensing vision-language reasoning.
Problem

Research questions and friction points this paper is trying to address.

remote sensing
vision-language models
visual evidence reasoning
language priors
spatial reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

native multimodal modeling
encoder-free architecture
modality-aware decoupling
visual reliance benchmark
remote sensing vision-language reasoning
X
Xiao Yang
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
R
Ronghao Fu
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Z
Zhiwen Lin
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Z
Zhuoran Duan
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
J
Jiashun Zhu
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
J
Jiasen Hu
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
L
Lang Sun
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
W
Weipeng Zhang
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Jiaqi Liu
Jiaqi Liu
PhD student, Department of Computer Science, UNC Chapel Hill
Embodied IntelligenceVision-Language ModelReinforcement LearningAutonomous Vehicle
X
Xu Na
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education
Haoran Liu
Haoran Liu
Ph.D. Student, Department of Computer Science & Engineering, Texas A&M University
LLMsGraph/Geometric LearningAI for ScienceGenerative Models
Weijie Zhang
Weijie Zhang
University of Kansas Medical Center
Inverse planningparticle therapy
B
Bo Yang
College of Computer Science and Technology, Jilin University, China; Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education