Learning to Reason as Action Abstractions with Scalable Mid-Training RL

📅 2025-09-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of action-space redundancy and slow convergence during the mid-training phase of reinforcement learning (RL) for large language models (LLMs). We propose Reasoning as Action Abstractions (RA3), a framework that models reasoning as learnable action abstractions, constructing compact and temporally consistent decision representations in a latent space. Theoretically, RA3 is the first to characterize how mid-training dynamics affect post-training performance. Methodologically, it derives a sequence variational lower bound, jointly discovers latent structural priors via RL, and enables scalable mid-training through bootstrapped iterative fine-tuning. On code generation benchmarks—HumanEval and MBPP—RA3 achieves average improvements of +8.0% and +4.0%, respectively. Under the RLVR setting, it significantly accelerates convergence while enhancing final task performance.

Technology Category

Application Category

📝 Abstract
Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.
Problem

Research questions and friction points this paper is trying to address.

Developing scalable mid-training reinforcement learning for action abstractions
Optimizing action subspace to minimize pruning and planning errors
Improving code generation performance through temporally-consistent latent structures
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mid-training RL discovers action abstractions for reasoning
Sequential variational bound optimizes latent structures iteratively
Scalable algorithm improves code generation across multiple benchmarks
🔎 Similar Papers
No similar papers found.