Latent Chain-of-Thought for Visual Reasoning

📅 2025-10-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak generalization, reliance on biased reward models, and poor interpretability in visual reasoning with large vision-language models, this paper proposes a latent-variable-based visual reasoning framework. Methodologically, it formalizes reasoning as variational inference over implicit chains of thought, integrating amortized variational inference, diversity-aware reinforcement learning, and Bayesian posterior modeling, while introducing a sparse token-level reward function to eliminate dependence on deterministic sampling and costly search. Key contributions include: (i) the first integration of marginal likelihood evaluation with latent-variable decoding to enable reward-robust, diverse path ranking; and (ii) efficient, search-free decoding of high-quality reasoning chains. Evaluated on seven visual reasoning benchmarks, the method achieves new state-of-the-art performance, significantly improving accuracy, out-of-distribution generalization, and interpretability of the reasoning process.

Technology Category

Application Category

📝 Abstract
Chain-of-thought (CoT) reasoning is critical for improving the interpretability and reliability of Large Vision-Language Models (LVLMs). However, existing training algorithms such as SFT, PPO, and GRPO may not generalize well across unseen reasoning tasks and heavily rely on a biased reward model. To address this challenge, we reformulate reasoning in LVLMs as posterior inference and propose a scalable training algorithm based on amortized variational inference. By leveraging diversity-seeking reinforcement learning algorithms, we introduce a novel sparse reward function for token-level learning signals that encourage diverse, high-likelihood latent CoT, overcoming deterministic sampling limitations and avoiding reward hacking. Additionally, we implement a Bayesian inference-scaling strategy that replaces costly Best-of-N and Beam Search with a marginal likelihood to efficiently rank optimal rationales and answers. We empirically demonstrate that the proposed method enhances the state-of-the-art LVLMs on seven reasoning benchmarks, in terms of effectiveness, generalization, and interpretability.
Problem

Research questions and friction points this paper is trying to address.

Improves reasoning generalization in vision-language models
Addresses biased reward dependency in reasoning training
Enhances interpretability through latent chain-of-thought inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reformulates reasoning as posterior inference
Uses diversity-seeking RL with sparse rewards
Implements Bayesian scaling via marginal likelihood
🔎 Similar Papers
No similar papers found.
G
Guohao Sun
Rochester Institute of Technology
Hang Hua
Hang Hua
University of Rochester
Computer VisionNatural Language ProcessingMachine Learning
J
Jian Wang
Snap Inc.
J
Jiebo Luo
University of Rochester
S
Sohail Dianat
Rochester Institute of Technology
Majid Rabbani
Majid Rabbani
Fellow, Kodak
Image and Video Processing & Analysis
R
Raghuveer Rao
DEVCOM Army Research Laboratory
Zhiqiang Tao
Zhiqiang Tao
Assistant Professor, Rochester Institute of Technology
Machine LearningData MiningDeep LearningComputer Vision