EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
This work addresses the trade-off in large language model post-training between using a learned critic as a baseline in reinforcement learning—which reduces variance but may introduce noise under sparse rewards—and simpler baselines like batch mean. The authors formulate baseline selection as a Kalman filtering problem, unifying PPO and GRPO as two extreme cases of its gain parameter, and establish the first theoretical link between explained variance (EV) and advantage estimation variance. Leveraging a single-batch computable EV metric, they propose an adaptive strategy that switches between critic-based and batch-mean baselines, theoretically guaranteeing per-step variance no worse than the better of the two and proving optimality at a zero switching threshold. Experiments show that the resulting method, EVPO, consistently outperforms both PPO and GRPO across four tasks, with its adaptive gating effectively tracking critic maturity during training.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) for LLM post-training faces a fundamental design choice: whether to use a learned critic as a baseline for policy optimization. Classical theory favors critic-based methods such as PPO for variance reduction, yet critic-free alternatives like GRPO have gained widespread adoption due to their simplicity and competitive performance. We show that in sparse-reward settings, a learned critic can inject estimation noise that exceeds the state signal it captures, increasing rather than reducing advantage variance. By casting baseline selection as a Kalman filtering problem, we unify PPO and GRPO as two extremes of the Kalman gain and prove that explained variance (EV), computable from a single training batch, identifies the exact boundary: positive EV indicates the critic reduces variance, while zero or negative EV signals that it inflates variance. Building on this insight, we propose Explained Variance Policy Optimization (EVPO), which monitors batch-level EV at each training step and adaptively switches between critic-based and batch-mean advantage estimation, provably achieving no greater variance than the better of the two at every step. Across four tasks spanning classical control, agentic interaction, and mathematical reasoning, EVPO consistently outperforms both PPO and GRPO regardless of which fixed baseline is stronger on a given task. Further analysis confirms that the adaptive gating tracks critic maturation over training and that the theoretically derived zero threshold is empirically optimal.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
LLM post-training
critic utilization
variance reduction
sparse-reward
Innovation

Methods, ideas, or system contributions that make the work stand out.

Explained Variance
Policy Optimization
Adaptive Critic
Kalman Filtering
Reinforcement Learning
C
Chengjun Pan
Peking University
Shichun Liu
Shichun Liu
Fudan University
NLP
J
Jiahang Lin
Fudan University
D
Dingwei Zhu
Fudan University
Jiazheng Zhang
Jiazheng Zhang
Fudan University
Large Language ModelNatural Language ProcessingData Mining
Shihan Dou
Shihan Dou
Fudan University
LLMsCode LMsRLAlignment
S
Songyang Gao
Shanghai AI Lab
Z
Zhenhua Han
Shanghai Qiji Zhifeng Co., Ltd.
B
Binghai Wang
Fudan University
R
Rui Zheng
Shanghai Qiji Zhifeng Co., Ltd.
X
Xuanjing Huang
Fudan University
T
Tao Gui
Fudan University
Yansong Feng
Yansong Feng
Peking University
Natural Language ProcessingPattern Recognition