🤖 AI Summary
This work studies meta-reinforcement learning (meta-RL) for finite-horizon Markov decision processes (MDPs), assuming that the optimal action-value functions across tasks share a linear structure $Q^*_h(s,a)=Phi_h(s,a) heta^{(k)}_h$, with task-specific parameters $ heta^{(k)}_h$ drawn from a Gaussian meta-prior $mathcal{N}( heta^*_h,Sigma^*_h)$. To enable efficient cross-task knowledge transfer, we propose two Thompson sampling–based algorithms: MTSRL and its enhanced variant MTSRL+. MTSRL establishes the first $ ilde{O}(H^4 S^{3/2} sqrt{ANK})$ meta-regret bound for learning-based prior meta-RL. MTSRL+ further introduces prior alignment to jointly optimize posterior estimation and meta-oracle performance. Both algorithms integrate randomized value functions, recursive least-squares initialization, ordinary least-squares (OLS) aggregation, and covariance expansion. Empirical evaluation on recommendation systems demonstrates substantial improvements over independent RL and bandit baselines.
📝 Abstract
We study meta-reinforcement learning in finite-horizon MDPs where related tasks share similar structures in their optimal action-value functions. Specifically, we posit a linear representation $Q^*_h(s,a)=Φ_h(s,a),θ^{(k)}_h$ and place a Gaussian meta-prior $ mathcal{N}(θ^*_h,Σ^*_h)$ over the task-specific parameters $θ^{(k)}_h$. Building on randomized value functions, we propose two Thompson-style algorithms: (i) MTSRL, which learns only the prior mean and performs posterior sampling with the learned mean and known covariance; and (ii) $ ext{MTSRL}^{+}$, which additionally estimates the covariance and employs prior widening to control finite-sample estimation error. Further, we develop a prior-alignment technique that couples the posterior under the learned prior with a meta-oracle that knows the true prior, yielding meta-regret guarantees: we match prior-independent Thompson sampling in the small-task regime and strictly improve with more tasks once the prior is learned. Concretely, for known covariance we obtain $ ilde{O}(H^{4}S^{3/2}sqrt{ANK})$ meta-regret, and with learned covariance $ ilde{O}(H^{4}S^{3/2}sqrt{AN^3K})$; both recover a better behavior than prior-independent after $K gtrsim ilde{O}(H^2)$ and $K gtrsim ilde{O}(N^2H^2)$, respectively. Simulations on a stateful recommendation environment (with feature and prior misspecification) show that after brief exploration, MTSRL/MTSRL(^+) track the meta-oracle and substantially outperform prior-independent RL and bandit-only meta-baselines. Our results give the first meta-regret guarantees for Thompson-style RL with learned Q-priors, and provide practical recipes (warm-start via RLSVI, OLS aggregation, covariance widening) for experiment-rich settings.