Reasoning without Regret

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the credit assignment difficulty and slow convergence in chain-of-thought reasoning for large language models, arising from reliance solely on sparse final-reward signals. To this end, we propose Backward Adaptive Reward Shaping (BARS), a human-annotation-free framework that automatically transforms outcome-level rewards into robust, stepwise supervision signals. Methodologically, we establish the first no-regret learning theory tailored to sparse rewards—integrating chain-wise analysis, nonlinear Feynman–Kac bounds, and continuous-scale limits—to derive a dynamic regret bound of $O(log T)$. By combining backward Euler integration, terminal-prior modeling, and $(Delta,varepsilon)$-gap reward design, BARS achieves an iteration complexity of $O((R_{max}/Delta)log(1/varepsilon))$ within $T$ rounds. This work provides the first rigorous theoretical foundation guaranteeing both convergence and computational efficiency for systems such as DeepSeek R1.

Technology Category

Application Category

📝 Abstract

Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and $(Delta, epsilon)$-gap rewards, our backward Euler solver achieves $epsilon$-accuracy in $Oleft((R_{max}/Delta)log(1/epsilon) ight)$ iterations with $O(log T)$ dynamic regret over $T$ rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.

Problem

Research questions and friction points this paper is trying to address.

Improving credit assignment in outcome-based rewards for language models

Reducing human supervision in procedure-based reward systems

Developing a no-regret framework for efficient reward shaping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backwards Adaptive Reward Shaping (BARS) framework

Sparse rewards from terminal-state priors

Bellman contraction and gap-based rewards

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting