PRTS: A Primitive Reasoning and Tasking System via Contrastive Representations

📅 2026-04-30
📈 Citations: 0
Influential: 0
📄 PDF

career value

190K/year
🤖 AI Summary
This work addresses the limitation of existing vision-language-action models, which reduce pretraining to supervised behavioral cloning and neglect the goal-directed nature and temporal dynamics inherent in robotic learning. The authors propose reframing pretraining through goal-conditioned reinforcement learning, where language instructions are encoded as goals. Within a unified embedding space, contrastive learning aligns state-action embeddings with goal embeddings such that their inner product approximates the probability of goal reachability, thereby implicitly evaluating physical feasibility. Innovatively incorporating goal-reachability awareness enables self-supervised extraction of dense learning signals from offline trajectories without reward annotations. A role-aware causal masking mechanism efficiently integrates this signal into the vision-language backbone. The method achieves state-of-the-art performance on LIBERO, SimplerEnv, and 14 real-world complex tasks, demonstrating substantial improvements in success rate and planning capability—particularly in long-horizon, high-contact, and zero-shot novel instruction scenarios.
📝 Abstract
Vision-Language-Action (VLA) models advance robotic control via strong visual-linguistic priors. However, existing VLAs predominantly frame pretraining as supervised behavior cloning, overlooking the fundamental nature of robot learning as a goal-reaching process that requires understanding temporal task progress. We present \textbf{PRTS} (\textbf{P}rimitive \textbf{R}easoning and \textbf{T}asking \textbf{S}ystem), a VLA foundation model that reformulates pretraining through Goal-Conditioned Reinforcement Learning. By treating language instructions as goals and employing contrastive reinforcement learning, PRTS learns a unified embedding space where the inner product of state-action and goal embeddings approximates the log-discounted goal occupancy, the probability of reaching the language-specified goal from the current state-action, quantitatively assessing physical feasibility beyond static semantic matching. PRTS draws this dense goal-reachability supervision directly from offline trajectories without reward annotations, and folds it into the VLM backbone via a role-aware causal mask, incurring negligible overhead over vanilla behavior cloning. This paradigm endows the high-level reasoning system with intrinsic goal reachability awareness, bridging semantic reasoning and temporal task progress, and further benefits goal-conditioned action prediction. Pretrained on 167B tokens of diverse manipulation and embodied-reasoning data, PRTS reaches state-of-the-art performance on LIBERO, LIBERO-Pro, LIBERO-Plus, SimplerEnv, and a real-world suite of 14 complex tasks, with particularly substantial gains on long-horizon, contact-rich, and zero-shot novel-instruction settings, confirming that injecting goal-reachability awareness significantly improves both execution success and long-horizon planning of general-purpose robotic foundation policies.
Problem

Research questions and friction points this paper is trying to address.

Vision-Language-Action models
goal-reaching
temporal task progress
robotic control
goal reachability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Goal-Conditioned Reinforcement Learning
Contrastive Representations
Vision-Language-Action Models
Goal Reachability Awareness
Offline Robot Learning
Yang Zhang
Yang Zhang
Ph.D. Student, Tsinghua University
Reinforcement LearningMulti-Agent SystemsEmbodied AIWorld Models
J
Jiangyuan Zhao
Institute of Artificial Intelligence (TeleAI), China Telecom; Shanghai Jiao Tong University
Chenyou Fan
Chenyou Fan
Associate Professor, South China Normal University
computer visionmachine learning
F
Fangzheng Yan
Institute of Artificial Intelligence (TeleAI), China Telecom
T
Tian Li
Institute of Artificial Intelligence (TeleAI), China Telecom
H
Haitong Tang
Institute of Artificial Intelligence (TeleAI), China Telecom
S
Sen Fu
Institute of Artificial Intelligence (TeleAI), China Telecom
X
Xuan'er Wu
Institute of Artificial Intelligence (TeleAI), China Telecom
Qizhen Weng
Qizhen Weng
Hong Kong University of Science and Technology
Machine Learning SystemsAI InfrastructureCloud Computing
Weinan Zhang
Weinan Zhang
Professor, Shanghai Jiao Tong University
Reinforcement LearningAgentsData Science
Xiu Li
Xiu Li
Bytedance Seed
Computer VisionComputer Graphics3D Vision
C
Chi Zhang
Institute of Artificial Intelligence (TeleAI), China Telecom
Chenjia Bai
Chenjia Bai
Institute of Artificial Intelligence, China Telecom(中国电信人工智能研究院, TeleAI)
Reinforcement LearningRoboticsEmbodied AI
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom