PACE: Prefix-Protected and Difficulty-Aware Compression for Efficient Reasoning

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the inefficiency of language reasoning models that, during inference, often generate excessively long reasoning traces due to “overthinking,” leading to increased latency and memory overhead. Existing uniform length-penalization strategies tend to truncate crucial early reasoning steps and fail to account for varying problem difficulty. To overcome these limitations, we propose PACE, a novel framework that introduces prefix preservation at the sequence level to retain effective reasoning paths and incorporates difficulty-aware length penalization at the group level to dynamically modulate compression intensity. PACE is the first approach to jointly leverage prefix protection and difficulty awareness, enabling hierarchical supervision across two compression levels. Evaluated on the DeepSeek-R1-Distill-Qwen model, PACE reduces token usage by up to 55.7% on mathematical benchmarks while improving accuracy by as much as 4.1%, and demonstrates strong generalization across code, scientific, and general-domain tasks.

Technology Category

Application Category

📝 Abstract

Language Reasoning Models (LRMs) achieve strong performance by scaling test-time computation but often suffer from ``overthinking'', producing excessively long reasoning traces that increase latency and memory usage. Existing LRMs typically enforce conciseness with uniform length penalties, which over-compress crucial early deduction steps at the sequence level and indiscriminately penalize all queries at the group level. To solve these limitations, we propose \textbf{\model}, a dual-level framework for prefix-protected and difficulty-aware compression under hierarchical supervision. At the sequence level, prefix-protected optimization employs decaying mixed rollouts to maintain valid reasoning paths while promoting conciseness. At the group level, difficulty-aware penalty dynamically scales length constraints based on query complexity, maintaining exploration for harder questions while curbing redundancy on easier ones. Extensive experiments on DeepSeek-R1-Distill-Qwen (1.5B/7B) demonstrate that \model achieves a substantial reduction in token usage (up to \textbf{55.7\%}) while simultaneously improving accuracy (up to \textbf{4.1\%}) on math benchmarks, with generalization ability to code, science, and general domains.

Problem

Research questions and friction points this paper is trying to address.

overthinking

reasoning compression

length penalty

query difficulty

reasoning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

prefix-protected compression

difficulty-aware penalty

hierarchical supervision