Text-to-Decision Agent: Learning Generalist Policies from Natural Language Supervision

📅 2025-04-21

📈 Citations: 0

✨ Influential: 0

career value

152K/year

🤖 AI Summary

Existing reinforcement learning generalization relies on high-quality samples or environment pre-exploration, entailing prohibitive supervision costs and poor scalability to unseen tasks. This paper proposes a zero-shot policy learning framework that directly generates decision actions from natural language instructions—without task annotations or environmental pre-exploration. Our approach introduces three key innovations: (1) the first language-to-decision contrastive pre-training paradigm; (2) a dynamics-aware universal world model enabling cross-modal alignment between textual semantics and environmental dynamics; and (3) an integrated architecture combining multi-task world model encoding, text-conditioned policy networks, and a CLIP-style alignment mechanism. Evaluated on MuJoCo and Meta-World, our method substantially outperforms supervised fine-tuning, instruction tuning, and imitation learning baselines, achieving, for the first time, genuine text-driven zero-shot generalization across diverse robotic control tasks.

Technology Category

Application Category

📝 Abstract

RL systems usually tackle generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose Text-to-Decision Agent (T2DA), a simple and scalable framework that supervises generalist policy learning with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines.

Problem

Research questions and friction points this paper is trying to address.

Learning generalist policies from natural language supervision

Bridging semantic gap between text and decision embeddings

Enabling zero-shot text-to-decision generation for unseen tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized world model encodes multi-task data

Contrastive language-decision pre-training bridges semantics

Text-conditioned policy enables zero-shot decision generation

🔎 Similar Papers

A Survey on Large Language Model based Autonomous Agents