🤖 AI Summary
Existing reinforcement learning generalization relies on high-quality samples or environment pre-exploration, entailing prohibitive supervision costs and poor scalability to unseen tasks. This paper proposes a zero-shot policy learning framework that directly generates decision actions from natural language instructions—without task annotations or environmental pre-exploration. Our approach introduces three key innovations: (1) the first language-to-decision contrastive pre-training paradigm; (2) a dynamics-aware universal world model enabling cross-modal alignment between textual semantics and environmental dynamics; and (3) an integrated architecture combining multi-task world model encoding, text-conditioned policy networks, and a CLIP-style alignment mechanism. Evaluated on MuJoCo and Meta-World, our method substantially outperforms supervised fine-tuning, instruction tuning, and imitation learning baselines, achieving, for the first time, genuine text-driven zero-shot generalization across diverse robotic control tasks.
📝 Abstract
RL systems usually tackle generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose Text-to-Decision Agent (T2DA), a simple and scalable framework that supervises generalist policy learning with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines.