Beyond GSD-as-Token: Continuous Scale Conditioning for Remote Sensing VLMs

πŸ“… 2026-05-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

187K/year
πŸ€– AI Summary
This work addresses the challenge that existing remote sensing vision-language models struggle to effectively represent the same geographic entity across orders-of-magnitude scale variations, primarily due to neglecting or discretizing ground sampling distance (GSD). To overcome this limitation, we propose ScaleEarth, a novel framework built upon Qwen3-VL that incorporates a continuous-scale conditioning mechanism, CS-HLoRA, which dynamically modulates the model’s computational pathways using GSD as a continuous variable. The framework is further enhanced by an SSE-U heteroscedastic GSD prediction head and trained on GeoScale-VQA, the first large-scale, scale-aligned remote sensing visual question answering dataset, forming a closed-loop training system. Extensive experiments demonstrate that ScaleEarth achieves state-of-the-art performance on benchmarks such as XLRS-Bench and OmniEarth-Bench, significantly improving cross-scale generalization capabilities.
πŸ“ Abstract
Remote sensing vision-language models (RS-VLMs) face a fundamental mismatch with natural-image counterparts: the same geographic object exhibits radically different visual evidence across ground sampling distances (GSDs) spanning multiple orders of magnitude. Yet existing RS-VLMs often discard GSD or inject it as a discrete text token, forcing a single static parameter set to absorb the entire scale spectrum. We introduce ScaleEarth, a parameter-efficient fine-tuning framework built on Qwen3-VL that treats GSD as a continuous conditioning variable governing the model's computation path. At its core, CS-HLoRA (Continuous Scale-Conditioned Hyper-LoRA) modulates the LoRA low-rank subspace through a GSD-driven gate, enabling the model to dynamically route computation by physical scale. To remove reliance on sensor metadata at deployment, we pair CS-HLoRA with SSE-U, a lightweight heteroscedastic sub-head that predicts GSD and its uncertainty from visual features. To provide matching supervision, we construct GeoScale-VQA, a 1.5M-sample scale-layered RS-VQA corpus whose question-answer generation is conditioned on the same physical scalar that drives CS-HLoRA, forming a closed method-data loop. Trained with QLoRA on an 8B backbone, ScaleEarth achieves state-of-the-art results on remote-sensing benchmarks covering diverse Earth-system tasks, including XLRS-Bench and OmniEarth-Bench.
Problem

Research questions and friction points this paper is trying to address.

Remote sensing
Vision-language models
Ground sampling distance
Scale variation
Physical scale
Innovation

Methods, ideas, or system contributions that make the work stand out.

Continuous Scale Conditioning
CS-HLoRA
Remote Sensing VLMs
GSD Prediction
Parameter-Efficient Fine-Tuning