🤖 AI Summary
This work addresses cross-task generalization in reinforcement learning, where agents must rapidly adapt to unseen tasks that are dynamically similar yet differ in reward functions. To this end, we propose Task-Aware Dreamer (TAD), a novel framework introducing (i) a reward-aware world model and (ii) a task-discriminative variational objective, alongside the Task Distribution Relevance (TDR) metric to quantify inter-task divergence. TAD integrates a variational inference-based world model, reward-conditioned latent representation learning, Dreamer-style model-based prediction and policy optimization, and a TDR-driven policy selection mechanism. Experiments across image- and state-space multi-task benchmarks demonstrate substantial improvements in parallel training efficiency and zero-shot generalization performance. Notably, TAD significantly outperforms conventional Markovian policies—especially under high-TDR conditions—highlighting its efficacy in handling reward-divergent task distributions.
📝 Abstract
A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent's adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.