Online Episodic Convex Reinforcement Learning

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper studies Convex Utility Reinforcement Learning (CURL)—online optimization of a convex objective function over state-action distributions in finite-horizon Markov decision processes, relaxing the standard linear loss assumption. Due to nonlinearity, the Bellman equation fails, and the utility depends on the global state-action distribution, rendering classical RL methods inapplicable. To address this, the authors propose a novel model-free algorithm integrating online mirror descent, dynamically constrained action sets, and an exploration bonus mechanism. Notably, they introduce bandit feedback—where only the aggregate utility value is observed—for the first time in CURL. Theoretically, the algorithm achieves sublinear regret bounds under both full-information and bandit feedback settings, without requiring prior knowledge of the transition dynamics. This constitutes the first general framework for online CURL with near-optimal regret guarantees and the first work to resolve bandit-feedback CURL.

Technology Category

Application Category

📝 Abstract
We study online learning in episodic finite-horizon Markov decision processes (MDPs) with convex objective functions, known as the concave utility reinforcement learning (CURL) problem. This setting generalizes RL from linear to convex losses on the state-action distribution induced by the agent's policy. The non-linearity of CURL invalidates classical Bellman equations and requires new algorithmic approaches. We introduce the first algorithm achieving near-optimal regret bounds for online CURL without any prior knowledge on the transition function. To achieve this, we use an online mirror descent algorithm with varying constraint sets and a carefully designed exploration bonus. We then address for the first time a bandit version of CURL, where the only feedback is the value of the objective function on the state-action distribution induced by the agent's policy. We achieve a sub-linear regret bound for this more challenging problem by adapting techniques from bandit convex optimization to the MDP setting.
Problem

Research questions and friction points this paper is trying to address.

Extends RL to convex losses in episodic MDPs
Solves CURL without transition function knowledge
Addresses bandit CURL with sub-linear regret
Innovation

Methods, ideas, or system contributions that make the work stand out.

Online mirror descent with varying constraints
Exploration bonus for near-optimal regret
Bandit convex optimization in MDPs
🔎 Similar Papers
No similar papers found.
B
Bianca Marin Moreno
Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France
Khaled Eldowa
Khaled Eldowa
PhD Student, University of Milan
Online LearningReinforcement Learning
Pierre Gaillard
Pierre Gaillard
INRIA
M
Margaux Br'egere
EDF Lab, 7 bd Gaspard Monge, 91120 Palaiseau, France; Sorbonne Universit´e LPSM, Paris, France
N
Nadia Oudjane
EDF Lab, 7 bd Gaspard Monge, 91120 Palaiseau, France