Asking like Socrates: Socrates helps VLMs understand remote sensing images

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

A pervasive “pseudo-reasoning” problem exists in remote sensing multimodal reasoning: models generate answers based on linguistic self-consistency rather than visual evidence, primarily due to single-pass coarse perception (the “Glance Effect”) of high-resolution remote sensing imagery. To address this, we propose RS-EoT—a language-driven iterative visual evidence search paradigm—and introduce SocraticAgent, a Socratic-style multi-agent framework. SocraticAgent integrates self-play, fine-grained object localization via reinforcement learning, and a two-stage progressive vision-language pretraining and VQA fine-tuning pipeline, enabling an interpretable evidence-reasoning closed loop. RS-EoT significantly mitigates the Glance Effect, achieving state-of-the-art performance across multiple remote sensing visual question answering and referring expression grounding benchmarks. Empirical evaluation confirms its capacity for transparent, traceable visual evidence interaction and stepwise reasoning.

Technology Category

Application Category

📝 Abstract

Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates

Problem

Research questions and friction points this paper is trying to address.

Addresses pseudo reasoning in remote sensing vision-language models

Mitigates the Glance Effect for evidence-based image understanding

Enables iterative visual evidence-seeking in remote sensing tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Language-driven iterative visual evidence-seeking paradigm

Self-play multi-agent system synthesizing reasoning traces

Two-stage progressive reinforcement learning strategy

🔎 Similar Papers

No similar papers found.

Authors to Follow