🤖 AI Summary
This work addresses the credit assignment difficulty and slow convergence in chain-of-thought reasoning for large language models, arising from reliance solely on sparse final-reward signals. To this end, we propose Backward Adaptive Reward Shaping (BARS), a human-annotation-free framework that automatically transforms outcome-level rewards into robust, stepwise supervision signals. Methodologically, we establish the first no-regret learning theory tailored to sparse rewards—integrating chain-wise analysis, nonlinear Feynman–Kac bounds, and continuous-scale limits—to derive a dynamic regret bound of $O(log T)$. By combining backward Euler integration, terminal-prior modeling, and $(Delta,varepsilon)$-gap reward design, BARS achieves an iteration complexity of $O((R_{max}/Delta)log(1/varepsilon))$ within $T$ rounds. This work provides the first rigorous theoretical foundation guaranteeing both convergence and computational efficiency for systems such as DeepSeek R1.
📝 Abstract
Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and $(Delta, epsilon)$-gap rewards, our backward Euler solver achieves $epsilon$-accuracy in $Oleft((R_{max}/Delta)log(1/epsilon)
ight)$ iterations with $O(log T)$ dynamic regret over $T$ rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.