🤖 AI Summary
To address the low reasoning accuracy, weak self-reflection capability, and insufficient decision transparency of large language models (LLMs) in complex reasoning tasks, this paper proposes a dual-model collaborative verbal reflection framework operating at inference time. Methodologically, it introduces a novel “reasoning–critique” decoupled architecture, integrating contrastive verbal reflection synthesis with dynamic collaboration, enabling real-time self-critique and refinement during reasoning under a verbal reinforcement learning paradigm. By explicitly contrasting model-specialized reasoning outputs with generated critiques, the framework simultaneously enhances both reasoning accuracy and process interpretability. Experimental results demonstrate that our approach significantly outperforms conventional preference optimization methods across all evaluation metrics, validating the synergistic gains of dual-model collaboration in improving both reasoning performance and transparency.
📝 Abstract
Large Language Models (LLMs) often struggle with complex reasoning scenarios. While preference optimization methods enhance reasoning performance through training, they often lack transparency in why one reasoning outcome is preferred over another. Verbal reflection techniques improve explainability but are limited in LLMs' critique and refinement capacity. To address these challenges, we introduce a contrastive reflection synthesis pipeline that enhances the accuracy and depth of LLM-generated reflections. We further propose a dual-model reasoning framework within a verbal reinforcement learning paradigm, decoupling inference-time self-reflection into specialized, trained models for reasoning critique and refinement. Extensive experiments show that our framework outperforms traditional preference optimization methods across all evaluation metrics. Our findings also show that"two heads are better than one", demonstrating that a collaborative Reasoner-Critic model achieves superior reasoning performance and transparency, compared to single-model approaches.