🤖 AI Summary
Large language models (LLMs) suffer from limited reasoning diversity and rigid, single-path inference due to the prevailing “one-problem–one-solution” training paradigm. Method: This paper proposes a “one-problem–multiple-solutions” training framework. Its core innovation is the first formal definition and quantification of Reasoning Path Discrepancy (RPD), a metric measuring semantic divergence among reasoning paths; RPD enables comparability and evaluability of long-chain reasoning trajectories via step-level semantic alignment. We further construct a high-diversity, multi-path reasoning dataset and fine-tune Qwen3-4B-Base on it. Contribution/Results: Experiments demonstrate substantial improvements in both reasoning diversity and accuracy: average pass@16 increases by 2.80%, and performance on AIME24 improves by 4.99%. The framework establishes a measurable, optimization-friendly paradigm for enhancing reasoning diversity—critical for test-time reasoning scaling.
📝 Abstract
While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .