Perception in Reflection

📅 2025-04-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current large vision-language models (LVLMs) suffer from inaccurate initial visual perception, distorted image descriptions, and frequent hallucinations. To address these issues, we propose the “Reflective Perception” (RePer) paradigm—a novel dual-model reflective framework wherein a strategy model and a critic model alternately collaborate. Our approach integrates Reflective Perception Learning (RPL), reflective non-likelihood training, and construction of a visually grounded reflective dataset to achieve fine-grained preference alignment and strong consistency with human attention patterns. Methodologically, it unifies multi-stage collaborative reasoning, attention interpretability modeling, and preference-aligned optimization. Experiments demonstrate significant improvements in image understanding accuracy and description fidelity, alongside a substantial reduction in hallucination rates. Moreover, the model’s attention distributions exhibit high congruence with human eye-tracking data, and the framework shows superior robustness on complex reasoning and multi-step tasks.

Technology Category

Application Category

📝 Abstract
We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training. Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation.
Problem

Research questions and friction points this paper is trying to address.

Overcoming limitations of large vision-language models in perception
Enhancing visual perception through dual-model reflection mechanism
Improving image understanding and reducing hallucination in models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-model reflection mechanism for perception refinement
Reflective Perceptual Learning with visual reflection dataset
Iterative refinement improves image understanding and alignment
🔎 Similar Papers
No similar papers found.