OLIVIA: Online Learning via Inference-time Action Adaptation for Decision Making in LLM ReAct Agents

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the degradation in efficiency, latency, and reliability of deployed LLM-based ReAct agents caused by error accumulation in action selection, a problem exacerbated by the absence of explicit, updatable decision mechanisms in existing test-time adaptation methods. To remedy this, the paper introduces the first online decision layer for ReAct based on contextual linear bandits, which leverages frozen LLM hidden states as contextual features and employs upper confidence bound (UCB) exploration with action-level feedback to enable lightweight, uncertainty-aware online updates. The approach preserves the original reasoning pipeline while supporting fine-grained, traceable action-level adaptation—without relying on prompt engineering or retrieval. Experiments demonstrate consistent improvements over static ReAct and current test-time baselines across four benchmarks, achieving sample-efficient policy enhancement with minimal computational overhead.

📝 Abstract

Large language model agents interleave reasoning, action selection, and observation to solve sequential decision-making tasks. In deployed settings where agents repeatedly handle related multi-step tasks, small action-selection errors can accumulate into wasted tool calls, latency, and reduced reliability. Despite this need for deployment-time improvement, existing inference-time adaptation methods for LLM agents mainly rely on prompting or retrieval, which influence behavior indirectly through context manipulation. For ReAct-style agents, such approaches do not expose an explicit decision layer that can score candidate actions, represent uncertainty, or be updated online from action-level feedback. As a result, they provide limited support for trackable, fine-grained, and uncertainty-aware adaptation during deployment. We propose OLIVIA, an inference-time action adaptation framework for ReAct-style agents. OLIVIA models the LLM's final action-selection layer as a contextual linear bandit over candidate actions, with frozen hidden states as decision contexts. This choice is particularly suitable for deployment because it adapts behavior directly at the action-selection interface, preserves the underlying reasoning process, and provides explicit uncertainty estimates and lightweight online updates from action-level feedback. With upper-confidence-bound exploration, OLIVIA improves the policy sample-efficiently with minimal computational overhead. We instantiate OLIVIA on four benchmarks and show that it consistently improves task performance over static ReAct and prompt-based inference-time baselines. Our results suggest that explicit online decision layers provide an effective alternative to purely prompt- or retrieval-based adaptation for LLM agents during deployment.

Problem

Research questions and friction points this paper is trying to address.

LLM agents

online adaptation

action selection

sequential decision-making

uncertainty-aware

Innovation

Methods, ideas, or system contributions that make the work stand out.

online learning

action adaptation

linear bandit