🤖 AI Summary
This work addresses the issue of "perceptual inertia" in fine-tuned remote sensing vision-language models, where reinforcement learning causes overreliance on local salient cues, leading to insufficient visual evidence exploration and inflexible attention shifts. To mitigate this, the study introduces the concept of perceptual inertia for the first time and proposes RS-HyRe-R1, a hybrid reward mechanism that integrates spatial reasoning activation, perceptual correctness, and visual-semantic path evolution to encourage deep and diverse visual reasoning. Implemented within a lightweight 3B-parameter reinforcement learning post-training framework, the method achieves state-of-the-art performance across referring expression comprehension (REC), open-vocabulary detection (OVD), and visual question answering (VQA) tasks, outperforming the next-best models by 3.16%, 3.97%, and 2.72% in zero-shot settings, respectively—even surpassing existing models with up to 7B parameters.
📝 Abstract
Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.