STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

155K/year

🤖 AI Summary

This work addresses the challenge of overthinking in long-chain reasoning, which incurs high computational costs and latency—particularly under low-data fine-tuning scenarios where optimization is difficult. The authors propose STOP, a novel method that achieves structured online pruning without teacher supervision in low-data settings. By leveraging self-distillation to construct reasoning trees, STOP integrates node splitting, classification-based labeling, and an Earliest Correct Node (ECN) pruning strategy to retain the shortest valid reasoning path while preserving semantic coherence. This approach substantially mitigates distribution shift and steers the model toward efficient exploration. Evaluated on multiple mathematical reasoning benchmarks, STOP reduces generated token counts by 19.4%–42.4% with minimal impact on accuracy, outperforming pruning methods that rely on teacher models.

📝 Abstract

Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.

Problem

Research questions and friction points this paper is trying to address.

Long chain-of-thought reasoning

overthinking

low-data fine-tuning

reasoning efficiency

redundant reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structured On-Policy Pruning

Long Chain-of-Thought Reasoning

Earliest Correct Node