BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of balancing computational and sample efficiency in reinforcement learning approaches for enhancing reasoning capabilities of large language models. We propose a critic-free post-training algorithm that, under a single-trajectory-per-iteration sampling regime, improves value estimation by leveraging batched information sharing across diverse prompts. Our method achieves, for the first time, cross-prompt advantage estimation within single-trajectory updates, substantially enhancing the accuracy of the value function. Experimental results demonstrate a 69% reduction in mean squared error of value estimates, with performance from a single trajectory surpassing the average of an 8-trajectory baseline, while significantly reducing training timeβ€”matching or exceeding the efficiency of current state-of-the-art methods.
πŸ“ Abstract
Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning. We introduce BASIS, a critic-free post-training algorithm designed to address this tradeoff. At each online training step, BASIS samples only one rollout per prompt, but leverages rich information across prompts in the entire batch to improve value function estimation. Our experiments demonstrate that BASIS reduces MSE in value function estimation by 69% compared to REINFORCE++, a representative single-rollout baseline, and achieves lower MSE with one rollout than group mean estimators with 8 rollouts. This improvement in value estimation translates to better policy optimization: using substantially less training time, BASIS achieves performance close to multi-rollout GRPO-type baselines and often outperforms single-rollout REINFORCE-type baselines.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
value estimation
sample efficiency
computational efficiency
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

BASIS
single-rollout
batchwise advantage estimation
value function estimation
LLM reasoning