Offline Meta-Reinforcement Learning with Flow-Based Task Inference and Adaptive Correction of Feature Overgeneralization

📅 2026-01-12
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges in offline meta-reinforcement learning where broad task distributions and ambiguous Markov Decision Processes (MDPs) often lead to out-of-distribution (OOD) action extrapolation errors and feature overgeneralization. To tackle these issues, the authors propose FLORA, a novel method that formally defines the “feature overgeneralization” problem for the first time. FLORA decomposes Q-values to disentangle features from weights and employs invertible flow models to accurately capture complex task distributions. Additionally, it introduces a return-feedback-driven adaptive feature correction mechanism that effectively mitigates OOD bias under offline settings. Experimental results demonstrate that FLORA significantly outperforms existing baselines across multiple environments, achieving faster meta-policy adaptation, higher performance, and improved policy stability.

Technology Category

Application Category

📝 Abstract
Offline meta-reinforcement learning (OMRL) combines the strengths of learning from diverse datasets in offline RL with the adaptability to new tasks of meta-RL, promising safe and efficient knowledge acquisition by RL agents. However, OMRL still suffers extrapolation errors due to out-of-distribution (OOD) actions, compromised by broad task distributions and Markov Decision Process (MDP) ambiguity in meta-RL setups. Existing research indicates that the generalization of the $Q$ network affects the extrapolation error in offline RL. This paper investigates this relationship by decomposing the $Q$ value into feature and weight components, observing that while decomposition enhances adaptability and convergence in the case of high-quality data, it often leads to policy degeneration or collapse in complex tasks. We observe that decomposed $Q$ values introduce a large estimation bias when the feature encounters OOD samples, a phenomenon we term''feature overgeneralization''. To address this issue, we propose FLORA, which identifies OOD samples by modeling feature distributions and estimating their uncertainties. FLORA integrates a return feedback mechanism to adaptively adjust feature components. Furthermore, to learn precise task representations, FLORA explicitly models the complex task distribution using a chain of invertible transformations. We theoretically and empirically demonstrate that FLORA achieves rapid adaptation and meta-policy improvement compared to baselines across various environments.
Problem

Research questions and friction points this paper is trying to address.

offline meta-reinforcement learning
out-of-distribution actions
feature overgeneralization
extrapolation error
task representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Offline Meta-Reinforcement Learning
Feature Overgeneralization
Flow-Based Task Inference
Adaptive Feature Correction
Out-of-Distribution Detection
🔎 Similar Papers
No similar papers found.
M
Min Wang
Beijing Institute of Technology
X
Xin Li
Beijing Institute of Technology, Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University
Mingzhong Wang
Mingzhong Wang
University of the Sunshine Coast
Machine learningMobile computing
H
Hasnaa Bennis
Beijing Institute of Technology