Knowing the Answer Isn't Enough: Fixing Reasoning Path Failures in LVLMs

๐Ÿ“… 2025-12-05
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Large Vision-Language Models (LVLMs) suffer from reasoning path selection bias: even when possessing correct knowledge, they tend to sample unstable and logically inconsistent reasoning paths, yielding unreliable answers. To address this, we propose Path Selection Optimization (PSO), a two-stage post-training framework that enhances reasoning stability and accuracy. PSO integrates Grouped Relative Policy Optimization (GRPO), template-answer joint reward modeling, a self-evaluation mechanism over self-generated reasoning paths, and a negative-sample replay buffer to suppress recurrent errors. Evaluated on multi-task visual reasoning benchmarks, PSO achieves an average accuracy gain of 7.4%, significantly reduces the proportion of invalid reasoning, and improves consistency and robustness of chain-of-thought generation. This work is the first to systematically identify and mitigate the โ€œknow-but-misreasonโ€ path bias in LVLMs, establishing a new paradigm for trustworthy multimodal reasoning.

Technology Category

Application Category

๐Ÿ“ Abstract
We reveal a critical yet underexplored flaw in Large Vision-Language Models (LVLMs): even when these models know the correct answer, they frequently arrive there through incorrect reasoning paths. The core issue is not a lack of knowledge, but a path selection bias within the vast reasoning search space. Although LVLMs are often capable of sampling correct solution trajectories, they disproportionately favor unstable or logically inconsistent ones, leading to erratic and unreliable outcomes. The substantial disparity between Pass@K (with large K) and Pass@1 across numerous models provides compelling evidence that such failures primarily stem from misreasoning rather than ignorance. To systematically investigate and address this issue, we propose PSO (Path-Select Optimization), a two-stage post-training framework designed to enhance both the reasoning performance and stability of existing LVLMs. In the first stage, we employ Group Relative Policy Optimization (GRPO) with template and answer-based rewards to cultivate structured, step-by-step reasoning. In the second stage, we conduct online preference optimization, where the model samples reasoning paths from GRPO-generated data, self-evaluates them, and aligns itself toward the preferred trajectories. Incorrect or suboptimal paths are concurrently stored in a Negative Replay Memory (NRM) as hard negatives, which are periodically revisited to prevent the model from repeating prior mistakes and to facilitate continual reasoning refinement. Extensive experiments show that PSO effectively prunes invalid reasoning paths, substantially enhances reasoning accuracy (with 7.4% improvements on average), and yields more stable and consistent chains of thought. Our code will be available at https://github.com/aiming-lab/PSO.
Problem

Research questions and friction points this paper is trying to address.

Addresses reasoning path failures in Large Vision-Language Models.
Mitigates path selection bias causing unstable or inconsistent reasoning.
Enhances reasoning accuracy and stability through post-training optimization.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage post-training framework for reasoning enhancement
Group Relative Policy Optimization with structured reward signals
Online preference optimization with Negative Replay Memory
๐Ÿ”Ž Similar Papers
No similar papers found.