Mixture-of-Experts Meets In-Context Reinforcement Learning

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the challenges of modeling multimodal state-action-reward data and poor generalization across heterogeneous decision-making tasks in In-Context Reinforcement Learning (ICRL), this paper proposes T2MIR—a novel framework integrating Transformer architectures, prompt-conditioned RL, and contrastive learning. T2MIR introduces a dual-path Mixture-of-Experts (MoE) architecture: a token-level path captures fine-grained intra-sequence patterns, while a task-level path models cross-task semantic disparities. Further, it incorporates a mutual information maximization–based contrastive task routing mechanism to mitigate multi-task gradient interference and enhance task awareness. Empirically, T2MIR achieves state-of-the-art performance on benchmark multi-task ICRL environments, demonstrating significant improvements in cross-task generalization over existing methods. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

In-context reinforcement learning (ICRL) has emerged as a promising paradigm for adapting RL agents to downstream tasks through prompt conditioning. However, two notable challenges remain in fully harnessing in-context learning within RL domains: the intrinsic multi-modality of the state-action-reward data and the diverse, heterogeneous nature of decision tasks. To tackle these challenges, we propose extbf{T2MIR} ( extbf{T}oken- and extbf{T}ask-wise extbf{M}oE for extbf{I}n-context extbf{R}L), an innovative framework that introduces architectural advances of mixture-of-experts (MoE) into transformer-based decision models. T2MIR substitutes the feedforward layer with two parallel layers: a token-wise MoE that captures distinct semantics of input tokens across multiple modalities, and a task-wise MoE that routes diverse tasks to specialized experts for managing a broad task distribution with alleviated gradient conflicts. To enhance task-wise routing, we introduce a contrastive learning method that maximizes the mutual information between the task and its router representation, enabling more precise capture of task-relevant information. The outputs of two MoE components are concatenated and fed into the next layer. Comprehensive experiments show that T2MIR significantly facilitates in-context learning capacity and outperforms various types of baselines. We bring the potential and promise of MoE to ICRL, offering a simple and scalable architectural enhancement to advance ICRL one step closer toward achievements in language and vision communities. Our code is available at https://github.com/NJU-RL/T2MIR.

Problem

Research questions and friction points this paper is trying to address.

Addresses multi-modality in state-action-reward data for ICRL

Handles diverse heterogeneous decision tasks in reinforcement learning

Enhances in-context learning with token- and task-wise MoE architecture

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-wise MoE captures multi-modal token semantics

Task-wise MoE routes tasks to specialized experts

Contrastive learning enhances task-wise routing precision

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL