🤖 AI Summary
To address weak generalization, reliance on biased reward models, and poor interpretability in visual reasoning with large vision-language models, this paper proposes a latent-variable-based visual reasoning framework. Methodologically, it formalizes reasoning as variational inference over implicit chains of thought, integrating amortized variational inference, diversity-aware reinforcement learning, and Bayesian posterior modeling, while introducing a sparse token-level reward function to eliminate dependence on deterministic sampling and costly search. Key contributions include: (i) the first integration of marginal likelihood evaluation with latent-variable decoding to enable reward-robust, diverse path ranking; and (ii) efficient, search-free decoding of high-quality reasoning chains. Evaluated on seven visual reasoning benchmarks, the method achieves new state-of-the-art performance, significantly improving accuracy, out-of-distribution generalization, and interpretability of the reasoning process.
📝 Abstract
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.