RS-HyRe-R1: A Hybrid Reward Mechanism to Overcome Perceptual Inertia for Remote Sensing Images Understanding

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

203K/year
🤖 AI Summary
This work addresses the issue of "perceptual inertia" in fine-tuned remote sensing vision-language models, where reinforcement learning causes overreliance on local salient cues, leading to insufficient visual evidence exploration and inflexible attention shifts. To mitigate this, the study introduces the concept of perceptual inertia for the first time and proposes RS-HyRe-R1, a hybrid reward mechanism that integrates spatial reasoning activation, perceptual correctness, and visual-semantic path evolution to encourage deep and diverse visual reasoning. Implemented within a lightweight 3B-parameter reinforcement learning post-training framework, the method achieves state-of-the-art performance across referring expression comprehension (REC), open-vocabulary detection (OVD), and visual question answering (VQA) tasks, outperforming the next-best models by 3.16%, 3.97%, and 2.72% in zero-shot settings, respectively—even surpassing existing models with up to 7B parameters.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) post-training substantially improves remote sensing vision-language models (RS-VLMs). However, when handling complex remote sensing imagery (RSI) requiring exhaustive visual scanning, models tend to rely on localized salient cues for rapid inference. We term this RL-induced bias "perceptual inertia". Driven by reward maximization, models favor quick outcome fitting, leading to two limitations: cognitively, overreliance on specific features impedes complete evidence construction; operationally, models struggle to flexibly shift visual focus across tasks. To address this bias and encourage comprehensive visual evidence mining, we propose RS-HyRe-R1, a hybrid reward framework for RSI understanding. It introduces: (1) a spatial reasoning activation reward that enforces structured visual reasoning; (2) a perception correctness reward that provides adaptive quality anchors across RS tasks, ensuring accurate geometric and semantic alignment; and (3) a visual-semantic path evolution reward that penalizes repetitive reasoning and promotes exploration of complementary cues to build richer evidence chains. Experiments show RS-HyRe-R1 effectively mitigates "perceptual inertia", encouraging deeper, more diverse reasoning. With only 3B parameters, it achieves state-of-the-art performance on REC, OVD, and VQA tasks, outperforming models up to 7B parameters. It also demonstrates strong zero-shot generalization, surpassing the second-best model by 3.16%, 3.97%, and 2.72% on VQA, OVD, and REC, respectively. Code and datasets are available at https://github.com/geox-lab/RS-HyRe-R1.
Problem

Research questions and friction points this paper is trying to address.

perceptual inertia
remote sensing image understanding
reinforcement learning bias
visual evidence mining
visual focus shifting
Innovation

Methods, ideas, or system contributions that make the work stand out.

perceptual inertia
hybrid reward mechanism
spatial reasoning
visual-semantic alignment
reinforcement learning
🔎 Similar Papers
No similar papers found.
G
Gaozhi Zhou
School of Mechanical and Electrical Engineering, Central South University, Changsha 410083, China
H
Hu He
School of Mechanical and Electrical Engineering, Central South University, Changsha 410083, China
P
Peng Shen
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
Jipeng Zhang
Jipeng Zhang
Hong Kong University of Science and Technology
natural language processingquestion answering
L
Liujue Zhang
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
L
Linrui Xu
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
Zeyuan Wang
Zeyuan Wang
PhD, The University of Sydney
NLPMedical Informatics
Ziyu Li
Ziyu Li
Philips I&D Data & AI
Knowledge ExtractionQuery OptimizationMachine LearningGraph
X
Xuezhi Cui
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
W
Wang Guo
School of Geosciences and Info-Physics, Central South University, Changsha 410083, China
Haifeng Li
Haifeng Li
Central South University
GISRemote sensingMachine learningSparse represetationBrain Theory