CLORE: Content-Level Optimization for Reasoning Efficiency

📅 2026-05-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

177K/year
🤖 AI Summary
This work addresses the tendency of large language models to generate verbose, repetitive, or semantically ambiguous reasoning traces during reinforcement learning–based post-training, a limitation exacerbated by existing approaches that rely solely on length constraints without supervising intermediate reasoning content. To overcome this, the authors propose CLORE, a novel framework that introduces content-level optimization by locally editing correct policy trajectories—such as removing redundant, irrelevant, or post-answer reasoning segments—while preserving the final answer. CLORE integrates an external augmentation model, a reference-free DPO objective, and policy gradient training to mitigate off-policy bias and remains compatible with various efficient training paradigms. Experiments demonstrate that CLORE significantly improves the trade-off between accuracy and efficiency across multiple mathematical reasoning benchmarks, with content analysis confirming marked reductions in repetitive reasoning, unreadable segments, and post-answer exploration.
📝 Abstract
Reinforcement learning post-training has improved the reasoning ability of large language models, but often produces unnecessarily long, repetitive, or semantically opaque reasoning traces. Existing efficient reasoning methods mainly regulate response length through explicit budgets or length-aware rewards, leaving intermediate reasoning content weakly supervised. We propose CLORE, a content-level optimization framework that improves reasoning efficiency by editing correct on-policy rollouts. CLORE uses an external augmentation model to delete repetitive segments, illegible or task-irrelevant content, and superfluous reasoning after the solution is established, while preserving the final answer. The resulting augmented--original pairs are optimized with an auxiliary reference-free DPO objective alongside standard policy-gradient training. By restricting augmentation to correct trajectories and performing local deletion, CLORE keeps edited rollouts close to the policy distribution and mitigates off-policy mismatch. Experiments on DeepSeek-R1-Distill-Qwen-7B and Qwen2.5-Math-7B across five mathematical reasoning benchmarks show that CLORE improves the accuracy--efficiency trade-off and remains compatible with GRPO, DAPO, Training Efficient, and ThinkPrune. Content-level analyses further show that CLORE reduces repetitive reasoning, illegible content, and post-answer exploration, supporting content-level supervision as a complementary direction to length-level control.
Problem

Research questions and friction points this paper is trying to address.

reasoning efficiency
reasoning traces
content-level supervision
repetitive reasoning
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

content-level optimization
reasoning efficiency
trajectory editing
reference-free DPO
reinforcement learning post-training