V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

πŸ“… 2026-03-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the susceptibility of multimodal large language models to perceptual hallucinations in fine-grained tasks due to static visual prefixes and the absence of dynamic verification mechanisms. The authors propose V-Reflection, a novel framework that introduces a β€œthink-before-look” visual reflection mechanism, enabling the model to actively query the visual feature space using hidden states as probes during reasoning, thereby anchoring each inference step to visual evidence. This approach pioneers a shift from passive observation to active interrogation of visual inputs, internalizing critical evidence localization through a two-stage distillation process without incurring additional inference overhead. Specifically, a Box-Guided Compression (BCM) module provides spatially explicit pixel compression, while a Dynamic Autoregressive Compression (DAC) module performs hidden-state-driven dynamic querying; knowledge distillation transfers BCM’s spatial grounding capability to DAC. Experiments demonstrate that V-Reflection significantly narrows the fine-grained perception gap across six perception-intensive benchmarks, with visualizations confirming its ability to autonomously locate task-relevant evidence while preserving the efficiency of end-to-end autoregressive decoding.
πŸ“ Abstract
Multimodal Large Language Models (MLLMs) have achieved remarkable success, yet they remain prone to perception-related hallucinations in fine-grained tasks. This vulnerability arises from a fundamental limitation: their reasoning is largely restricted to the language domain, treating visual input as a static, reasoning-agnostic preamble rather than a dynamic participant. Consequently, current models act as passive observers, unable to re-examine visual details to ground their evolving reasoning states. To overcome this, we propose V-Reflection, a framework that transforms the MLLM into an active interrogator through a "think-then-look" visual reflection mechanism. During reasoning, latent states function as dynamic probes that actively interrogate the visual feature space, grounding each reasoning step for task-critical evidence. Our approach employs a two-stage distillation strategy. First, the Box-Guided Compression (BCM) module establishes stable pixel-to-latent targets through explicit spatial grounding. Next, a Dynamic Autoregressive Compression (DAC) module maps the model's hidden states into dynamic probes that interrogate the global visual feature map. By distilling the spatial expertise of the BCM teacher into the DAC student, V-Reflection internalizes the ability to localize task-critical evidence. During inference, both modules remain entirely inactive, maintaining a purely end-to-end autoregressive decoding in the latent space with optimal efficiency. Extensive experiments demonstrate the effectiveness of our V-Reflection across six perception-intensive benchmarks, significantly narrowing the fine-grained perception gap. Visualizations confirm that latent reasoning autonomously localizes task-critical visual evidence.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Large Language Models
perception hallucination
fine-grained tasks
visual reasoning
passive observer
Innovation

Methods, ideas, or system contributions that make the work stand out.

V-Reflection
Multimodal Large Language Models
Visual Reflection
Dynamic Autoregressive Compression
Box-Guided Compression