Optimistic Transfer under Task Shift via Bellman Alignment

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of distributional shift between source and target tasks in online reinforcement learning, which introduces systematic bias when naively reusing source data and undermines regret guarantees. The authors propose the Reweighted Target (RWT) method, which formulates task transfer as a one-step Bellman alignment problem for the first time. By leveraging measure transformation and a reweighting operator, RWT corrects continuation values and compensates for transition discrepancies. The resulting two-stage RWT Q-learning framework decouples variance reduction from bias correction and, under RKHS function approximation, yields a regret bound that depends only on the complexity of the task shift. Experiments demonstrate that RWT significantly outperforms both single-task learning and naive data pooling in both tabular and neural network settings, validating Bellman alignment as an effective model-free principle for transfer.

Technology Category

Application Category

📝 Abstract
We study online transfer reinforcement learning (RL) in episodic Markov decision processes, where experience from related source tasks is available during learning on a target task. A fundamental difficulty is that task similarity is typically defined in terms of rewards or transitions, whereas online RL algorithms operate on Bellman regression targets. As a result, naively reusing source Bellman updates introduces systematic bias and invalidates regret guarantees. We identify one-step Bellman alignment as the correct abstraction for transfer in online RL and propose re-weighted targeting (RWT), an operator-level correction that retargets continuation values and compensates for transition mismatch via a change of measure. RWT reduces task mismatch to a fixed one-step correction and enables statistically sound reuse of source data. This alignment yields a two-stage RWT $Q$-learning framework that separates variance reduction from bias correction. Under RKHS function approximation, we establish regret bounds that scale with the complexity of the task shift rather than the target MDP. Empirical results in both tabular and neural network settings demonstrate consistent improvements over single-task learning and na\"{i}ve pooling, highlighting Bellman alignment as a model-agnostic transfer principle for online RL.
Problem

Research questions and friction points this paper is trying to address.

transfer reinforcement learning
task shift
Bellman alignment
online RL
regret guarantee
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bellman alignment
transfer reinforcement learning
re-weighted targeting
online RL
task shift
🔎 Similar Papers
No similar papers found.
J
Jinhang Chai
Department of Operations Research & Financial Engineering, Princeton University
E
Enpei Zhang
Department of Computer Science, Dartmouth College
Elynn Chen
Elynn Chen
New York University, Stern School of Business
Factor ModelsMatrix/Tensor Time SeriesReinforcement LearningInformation Fusion
Yujun Yan
Yujun Yan
Assistant Professor @ Dartmouth; PhD from U Michigan
Graph miningDeep learning