Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

📅 2026-01-12

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the challenges in offline meta-reinforcement learning where broad task distributions and ambiguous Markov Decision Processes (MDPs) often lead to out-of-distribution (OOD) action extrapolation errors and feature overgeneralization. To tackle these issues, the authors propose FLORA, a novel method that formally defines the “feature overgeneralization” problem for the first time. FLORA decomposes Q-values to disentangle features from weights and employs invertible flow models to accurately capture complex task distributions. Additionally, it introduces a return-feedback-driven adaptive feature correction mechanism that effectively mitigates OOD bias under offline settings. Experimental results demonstrate that FLORA significantly outperforms existing baselines across multiple environments, achieving faster meta-policy adaptation, higher performance, and improved policy stability.

Technology Category

Application Category

📝 Abstract

Offline meta-reinforcement learning (OMRL) combines the strengths of learning from diverse datasets in offline RL with the adaptability to new tasks of meta-RL, promising safe and efficient knowledge acquisition by RL agents. However, OMRL still suffers extrapolation errors due to out-of-distribution (OOD) actions, compromised by broad task distributions and Markov Decision Process (MDP) ambiguity in meta-RL setups. Existing research indicates that the generalization of the $Q$ network affects the extrapolation error in offline RL. This paper investigates this relationship by decomposing the $Q$ value into feature and weight components, observing that while decomposition enhances adaptability and convergence in the case of high-quality data, it often leads to policy degeneration or collapse in complex tasks. We observe that decomposed $Q$ values introduce a large estimation bias when the feature encounters OOD samples, a phenomenon we term''feature overgeneralization''. To address this issue, we propose FLORA, which identifies OOD samples by modeling feature distributions and estimating their uncertainties. FLORA integrates a return feedback mechanism to adaptively adjust feature components. Furthermore, to learn precise task representations, FLORA explicitly models the complex task distribution using a chain of invertible transformations. We theoretically and empirically demonstrate that FLORA achieves rapid adaptation and meta-policy improvement compared to baselines across various environments.

Problem

Research questions and friction points this paper is trying to address.

offline meta-reinforcement learning

out-of-distribution actions

feature overgeneralization

extrapolation error

task representation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline Meta-Reinforcement Learning

Feature Overgeneralization

Flow-Based Task Inference