Self-Prophetic Decoding to Unlock Visual Search in LVLMs

📅 2026-05-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large vision-language models (LVLMs) in visual grounding tasks, which stem from capability misalignment between pretraining and post-training stages and interference from long contexts during multi-step reasoning. To overcome these challenges, the authors propose Self-Prophecy Decoding (SeProD), a plug-and-play framework that requires no additional training. SeProD leverages an intrinsic self-regulation mechanism between pretrained and post-trained models, combined with probability-guided prophecy token sampling, to activate the model’s inherent single-step reasoning capacity for coherent multi-step visual search. The method operates in parallel with negligible computational overhead and consistently improves performance across four visual grounding benchmarks—encompassing twelve subsets—as well as general visual question answering (VQA) tasks.
📝 Abstract
Large Vision-Language Models (LVLMs) are rapidly evolving toward true multimodal reasoning, with visual search representing a concrete instantiation of the thinking-with-images paradigm. However, LVLM visual search faces two key challenges: incompatibility among intrinsic capabilities after post-training, and interference in long multi-step reasoning contexts. To address these, we identify two novel insights. First, self-regulation between pre- and post-training LVLMs leverages the intrinsic single-step capabilities of the pre-training model to mitigate capability deterioration and long-context interference. Second, probability-based prophetic sampling, replacing naive prompting, provides a probabilistic interface where the pre-training model acts as a prophet and the post-training model selectively accepts prophetic tokens under its output distribution, preserving coherent multi-step reasoning. Building on these insights, we introduce SeProD, a self-prophetic decoding framework that leverages intrinsic single-step capabilities to enable coherent multi-step reasoning in a training-free, plug-and-play manner. Experiments show that SeProD consistently improves multiple visual-search LVLMs across all 12 splits of 4 visual search benchmarks, as well as across general VQA benchmarks, without added computational overhead, thanks to its parallel prophetic acceptance mechanism.
Problem

Research questions and friction points this paper is trying to address.

visual search
Large Vision-Language Models
multi-step reasoning
capability incompatibility
reasoning interference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Prophetic Decoding
Visual Search
Multimodal Reasoning
Training-Free Framework
Prophetic Sampling
🔎 Similar Papers
No similar papers found.