Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

📅 2026-01-26

📈 Citations: 1

✨ Influential: 0

career value

219K/year

🤖 AI Summary

This work addresses the challenge that large language model (LLM) agents, due to frozen weights, struggle to continuously adapt to new tasks, while conventional reinforcement learning methods are computationally expensive and prone to catastrophic forgetting. To this end, the authors propose JitRL, a just-in-time reinforcement learning framework that achieves test-time policy optimization without gradient updates for the first time. JitRL employs a non-parametric memory to dynamically store experiences and retrieves relevant trajectories during inference to estimate action advantages, which directly modulate the LLM’s logits. Theoretically, this additive update is shown to be equivalent to the closed-form solution of policy optimization under a KL constraint. JitRL establishes a new state-of-the-art among training-free methods on WebArena and Jericho benchmarks, outperforming fine-tuning approaches such as WebRL while reducing computational costs by over 30×.

Technology Category

Application Category

📝 Abstract

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.

Problem

Research questions and friction points this paper is trying to address.

continual learning

large language models

reinforcement learning

catastrophic forgetting

frozen weights

Innovation

Methods, ideas, or system contributions that make the work stand out.

Just-In-Time Reinforcement Learning

training-free continual learning

non-parametric memory