🤖 AI Summary
Existing vision systems lack training-free robust reasoning capabilities for high-stakes domains such as remote sensing and medical diagnosis.
Method: We propose a training-free intelligent visual reasoning framework featuring a novel Think–Critique–Act agent-style reasoning loop. It seamlessly integrates vision-language models with pure vision models to dynamically construct verifiable reasoning chains and scale computational resources at test time—without fine-tuning or additional training.
Contribution/Results: By augmenting only the inference process, our method significantly improves system trustworthiness and robustness. On multiple challenging visual reasoning benchmarks, it achieves up to a 40 percentage-point absolute accuracy gain. This work provides the first systematic empirical validation of the critical role of “amplified test-time computation” in enhancing robustness for complex visual decision-making.
📝 Abstract
Developing trustworthy intelligent vision systems for high-stakes domains, emph{e.g.}, remote sensing and medical diagnosis, demands broad robustness without costly retraining. We propose extbf{Visual Reasoning Agent (VRA)}, a training-free, agentic reasoning framework that wraps off-the-shelf vision-language models emph{and} pure vision systems in a emph{Think--Critique--Act} loop. While VRA incurs significant additional test-time computation, it achieves up to 40% absolute accuracy gains on challenging visual reasoning benchmarks. Future work will optimize query routing and early stopping to reduce inference overhead while preserving reliability in vision tasks.