Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

📅 2026-04-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

206K/year
🤖 AI Summary
This work addresses the challenge of applying prioritized experience replay (PER) to reinforcement learning with large language models (LLMs) and vision-language models (VLMs), where rapidly evolving policies render stored priorities obsolete, leading to inefficient sampling or performance degradation. The authors propose a novel experience replay method that integrates an exponential age-decay mechanism to dynamically adjust the priorities of historical trajectories in alignment with fast policy updates. Compatible with policy gradient algorithms such as PPO and GRPO, the approach supports models ranging from 0.5B to 7B parameters and consistently outperforms on-policy baselines across eight reasoning, agent-based, and mathematical tasks—achieving gains of 46% on NQ Search, 367% on Sokoban, and 133% on VLM FrozenLake—while standard PER typically causes performance deterioration.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning (RL) has achieved impressive success in post-training Large Language Models (LLMs) and Vision-Language Models (VLMs), with on-policy algorithms such as PPO, GRPO, and REINFORCE++ serving as the dominant paradigm. However, these methods discard all collected trajectories after a single gradient update, resulting in poor sample efficiency, particularly wasteful for agentic tasks where multi-turn environment interactions are expensive. While Experience Replay drives sample efficiency in classic RL by allowing agents to reuse past trajectories and prioritize informative ones, directly applying Prioritized Experience Replay (PER) to LLMs fails. The rapid policy evolution of billion-parameter models renders stored priorities stale, causing old high-priority trajectories to dominate sampling long after they have become uninformative. We propose Freshness-Aware PER, which addresses this priority staleness problem by augmenting any PER-based priority with a multiplicative exponential age decay grounded in effective sample size analysis. To the best of our knowledge, Freshness-Aware PER is the first work to successfully apply PER to LLM/VLM reinforcement learning. We evaluate on eight multi-step agentic, reasoning, and math competition tasks with 0.5B, 3B, and 7B models. Freshness-Aware PER significantly outperforms on-policy baselines, achieving +46% on NQ Search, +367% on Sokoban, and +133% on VLM FrozenLake, while standard PER without age decay consistently degrades performance. Our code is publicly available at https://github.com/Vision-CAIR/Freshness-Aware-PER.
Problem

Research questions and friction points this paper is trying to address.

Prioritized Experience Replay
Large Language Models
Vision-Language Models
Reinforcement Learning
Sample Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Freshness-Aware PER
Prioritized Experience Replay
Sample Efficiency
LLM Reinforcement Learning
Priority Staleness
🔎 Similar Papers
No similar papers found.