Reasoning Path Divergence: A New Metric and Curation Strategy to Unlock LLM Diverse Thinking

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from limited reasoning diversity and rigid, single-path inference due to the prevailing “one-problem–one-solution” training paradigm. Method: This paper proposes a “one-problem–multiple-solutions” training framework. Its core innovation is the first formal definition and quantification of Reasoning Path Discrepancy (RPD), a metric measuring semantic divergence among reasoning paths; RPD enables comparability and evaluability of long-chain reasoning trajectories via step-level semantic alignment. We further construct a high-diversity, multi-path reasoning dataset and fine-tune Qwen3-4B-Base on it. Contribution/Results: Experiments demonstrate substantial improvements in both reasoning diversity and accuracy: average pass@16 increases by 2.80%, and performance on AIME24 improves by 4.99%. The framework establishes a measurable, optimization-friendly paradigm for enhancing reasoning diversity—critical for test-time reasoning scaling.

Technology Category

Application Category

📝 Abstract
While Test-Time Scaling (TTS) has proven effective in improving the reasoning ability of large language models (LLMs), low diversity in model outputs often becomes a bottleneck; this is partly caused by the common "one problem, one solution" (1P1S) training practice, which provides a single canonical answer and can push models toward a narrow set of reasoning paths. To address this, we propose a "one problem, multiple solutions" (1PNS) training paradigm that exposes the model to a variety of valid reasoning trajectories and thus increases inference diversity. A core challenge for 1PNS is reliably measuring semantic differences between multi-step chains of thought, so we introduce Reasoning Path Divergence (RPD), a step-level metric that aligns and scores Long Chain-of-Thought solutions to capture differences in intermediate reasoning. Using RPD, we curate maximally diverse solution sets per problem and fine-tune Qwen3-4B-Base. Experiments show that RPD-selected training yields more varied outputs and higher pass@k, with an average +2.80% gain in pass@16 over a strong 1P1S baseline and a +4.99% gain on AIME24, demonstrating that 1PNS further amplifies the effectiveness of TTS. Our code is available at https://github.com/fengjujf/Reasoning-Path-Divergence .
Problem

Research questions and friction points this paper is trying to address.

Measuring semantic differences in multi-step reasoning chains
Addressing low diversity in large language model outputs
Enhancing reasoning diversity through varied training trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes one problem multiple solutions training paradigm
Introduces Reasoning Path Divergence step-level metric
Curates diverse solution sets using RPD metric
🔎 Similar Papers
2024-09-10North American Chapter of the Association for Computational LinguisticsCitations: 1