Latent Chain-of-Thought for Visual Reasoning

📅 2025-10-27

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address weak generalization, reliance on biased reward models, and poor interpretability in visual reasoning with large vision-language models, this paper proposes a latent-variable-based visual reasoning framework. Methodologically, it formalizes reasoning as variational inference over implicit chains of thought, integrating amortized variational inference, diversity-aware reinforcement learning, and Bayesian posterior modeling, while introducing a sparse token-level reward function to eliminate dependence on deterministic sampling and costly search. Key contributions include: (i) the first integration of marginal likelihood evaluation with latent-variable decoding to enable reward-robust, diverse path ranking; and (ii) efficient, search-free decoding of high-quality reasoning chains. Evaluated on seven visual reasoning benchmarks, the method achieves new state-of-the-art performance, significantly improving accuracy, out-of-distribution generalization, and interpretability of the reasoning process.

Technology Category

Application Category

📝 Abstract

Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.

Problem

Research questions and friction points this paper is trying to address.

Improves reasoning generalization in vision-language models

Addresses biased reward dependency in reasoning training

Enhances interpretability through latent chain-of-thought inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates reasoning as posterior inference

Uses diversity-seeking RL with sparse rewards

Implements Bayesian scaling via marginal likelihood

🔎 Similar Papers

No similar papers found.