SSPO: Self-traced Step-wise Preference Optimization for Process Supervision and Reasoning Compression

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address error accumulation and overthinking in large language models (LLMs) caused by excessively lengthy chain-of-thought (CoT) reasoning steps, this paper proposes a plug-and-play, fine-grained reasoning optimization framework. The method introduces, for the first time, **self-generated stepwise preference signals**, enabling reinforcement learning–based stepwise supervision and compression of reasoning paths—without auxiliary models or human annotations. Its core contributions are: (1) dynamic generation of step-level preference signals directly from the model’s own reasoning trace; (2) joint optimization of answer accuracy and reasoning path conciseness; and (3) end-to-end CoT compression and self-correction. Experiments demonstrate substantial reductions in reasoning length and consistent improvements in answer accuracy across diverse domains and multilingual benchmarks, confirming strong robustness and generalization.

Technology Category

Application Category

📝 Abstract
Test-time scaling has proven effective in further enhancing the performance of pretrained Large Language Models (LLMs). However, mainstream post-training methods (i.e., reinforcement learning (RL) with chain-of-thought (CoT) reasoning) often incur substantial computational overhead due to auxiliary models and overthinking. In this paper, we empirically reveal that the incorrect answers partially stem from verbose reasoning processes lacking correct self-fix, where errors accumulate across multiple reasoning steps. To this end, we propose Self-traced Step-wise Preference Optimization (SSPO), a pluggable RL process supervision framework that enables fine-grained optimization of each reasoning step. Specifically, SSPO requires neither auxiliary models nor stepwise manual annotations. Instead, it leverages step-wise preference signals generated by the model itself to guide the optimization process for reasoning compression. Experiments demonstrate that the generated reasoning sequences from SSPO are both accurate and succinct, effectively mitigating overthinking behaviors without compromising model performance across diverse domains and languages.
Problem

Research questions and friction points this paper is trying to address.

Reduces computational overhead in LLM post-training methods
Addresses error accumulation in verbose reasoning processes
Optimizes reasoning steps without auxiliary models or annotations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-traced step-wise preference optimization
No auxiliary models or manual annotations
Step-wise preference signals guide optimization
🔎 Similar Papers
No similar papers found.
Y
Yuyang Xu
State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital Zhejiang University School of Medicine; College of Computer Science and Technology, Zhejiang University; Transvascular Implantation Devices Research Institute; Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence
Y
Yi Cheng
Alibaba Cloud Computing
H
Haochao Ying
State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital Zhejiang University School of Medicine; School of Public Health, Zhejiang University; Transvascular Implantation Devices Research Institute
Z
Zhuoyun Du
State Key Lab of CAD & CG, Zhejiang University; Zhejiang Polytechnic Institute, Polytechnic Institute, Zhejiang University
Renjun Hu
Renjun Hu
East China Normal University
Robust ML/AILLMsgraph mining
X
Xing Shi
Alibaba Cloud Computing
W
Wei Lin
Alibaba Cloud Computing
J
Jian Wu
State Key Laboratory of Transvascular Implantation Devices, The Second Affiliated Hospital Zhejiang University School of Medicine; School of Public Health, Zhejiang University; Transvascular Implantation Devices Research Institute; Zhejiang Key Laboratory of Medical Imaging Artificial Intelligence