Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning

📅 2025-05-12

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

In reinforcement learning for language models, high-noise rewards induce bias in GRPO’s advantage estimation. To address this, we propose Kalman-GRPO—the first method to incorporate lightweight Kalman filtering into advantage estimation. It dynamically and recursively estimates the intra-batch reward mean and variance, replacing the static batch-mean baseline with adaptive advantage normalization. Crucially, it introduces no additional learnable parameters, preserving computational efficiency. Evaluated across diverse reasoning tasks, Kalman-GRPO significantly improves training stability and policy optimization accuracy: advantage estimation variance decreases by 32%, average reasoning accuracy increases by 4.7%, and convergence accelerates by 21%. Our core contribution lies in adapting classical state estimation—specifically Kalman filtering—to advantage modeling in LLM-based RL, establishing a novel, noise-robust paradigm for policy optimization.

Technology Category

Application Category

📝 Abstract

Reward baseline is important for Reinforcement Learning (RL) algorithms to reduce variance in policy gradient estimates. Recently, for language modeling, Group Relative Policy Optimization (GRPO) is proposed to compute the advantage for each output by subtracting the mean reward, as the baseline, for all outputs in the group. However, it can lead to inaccurate advantage estimates in environments with highly noisy rewards, potentially introducing bias. In this work, we propose a model, called Kalman Filter Enhanced Group Relative Policy Optimization (KRPO), by using lightweight Kalman filtering to dynamically estimate the latent reward mean and variance. This filtering technique replaces the naive batch mean baseline, enabling more adaptive advantage normalization. Our method does not require additional learned parameters over GRPO. This approach offers a simple yet effective way to incorporate multiple outputs of GRPO into advantage estimation, improving policy optimization in settings where highly dynamic reward signals are difficult to model for language models. Through experiments and analyses, we show that using a more adaptive advantage estimation model, KRPO can improve the stability and performance of GRPO. The code is available at https://github.com/billhhh/KRPO_LLMs_RL

Problem

Research questions and friction points this paper is trying to address.

Improving advantage estimation in noisy reward environments

Enhancing policy optimization for language models

Reducing bias in GRPO with dynamic reward estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kalman Filter dynamically estimates reward mean

Lightweight filtering replaces naive batch mean

No additional parameters over GRPO required

🔎 Similar Papers

No similar papers found.