🤖 AI Summary
To address excessive computational overhead and low efficiency in complex reasoning tasks caused by lengthy chain-of-thought (CoT) generation in large language models (LLMs), this paper proposes DR.SAF—a Dynamic Reasoning Self-Aware Framework. DR.SAF abandons conventional paradigms relying on handcrafted difficulty priors and instead introduces three core mechanisms: boundary self-aware alignment, reinforcement learning–driven adaptive reward management, and inference path constraints—enabling LLMs to dynamically perceive and regulate their own reasoning depth. Experiments demonstrate a 49.27% reduction in total response tokens, a 6.59× improvement in token efficiency, a 5× decrease in training time, and over 16% accuracy gain under extreme training conditions. The key contribution is the first realization of endogenous, adaptive perception of reasoning difficulty by LLMs, coupled with controllable optimization of reasoning boundaries.
📝 Abstract
Recent advancements in large language models (LLMs) have greatly improved their capabilities on complex reasoning tasks through Long Chain-of-Thought (CoT). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. To improve the efficiency, current methods often rely on human-defined difficulty priors, which do not align with the LLM's self-awared difficulty, leading to inefficiencies. In this paper, we introduce the Dynamic Reasoning-Boundary Self-Awareness Framework (DR. SAF), which enables models to dynamically assess and adjust their reasoning depth in response to problem complexity. DR. SAF integrates three key components: Boundary Self-Awareness Alignment, Adaptive Reward Management, and a Boundary Preservation Mechanism. These components allow models to optimize their reasoning processes, balancing efficiency and accuracy without compromising performance. Our experimental results demonstrate that DR. SAF achieves a 49.27% reduction in total response tokens with minimal loss in accuracy. The framework also delivers a 6.59x gain in token efficiency and a 5x reduction in training time, making it well-suited to resource-limited settings. During extreme training, DR. SAF can even surpass traditional instruction-based models in token efficiency with more than 16% accuracy improvement.