Optimistic Transfer under Task Shift via Bellman Alignment

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the challenge of distributional shift between source and target tasks in online reinforcement learning, which introduces systematic bias when naively reusing source data and undermines regret guarantees. The authors propose the Reweighted Target (RWT) method, which formulates task transfer as a one-step Bellman alignment problem for the first time. By leveraging measure transformation and a reweighting operator, RWT corrects continuation values and compensates for transition discrepancies. The resulting two-stage RWT Q-learning framework decouples variance reduction from bias correction and, under RKHS function approximation, yields a regret bound that depends only on the complexity of the task shift. Experiments demonstrate that RWT significantly outperforms both single-task learning and naive data pooling in both tabular and neural network settings, validating Bellman alignment as an effective model-free principle for transfer.

Technology Category

Application Category

📝 Abstract

We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and na\"{i}ve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.

Problem

Research questions and friction points this paper is trying to address.

transfer reinforcement learning

task shift

Bellman alignment

online RL

regret guarantee

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bellman alignment

transfer reinforcement learning

re-weighted targeting