RationalVLA: A Rational Vision-Language-Action Model with Dual System

📅 2025-06-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Vision-Language-Action (VLA) models lack the capability to identify and reject defective natural-language instructions—common in real-world settings due to ambiguity, physical infeasibility, or environmental mismatch. Method: We propose RAMA, a novel benchmark comprising over 14,000 samples that systematically covers six categories of instruction defects spanning visual, physical, and semantic dimensions. We design a dual-system VLA architecture featuring learnable latent-space embeddings to dynamically align high-level semantics with low-level policies, integrated with a hybrid rule-and-learning-based instruction validity classifier. Contribution/Results: Our method achieves a 14.5% absolute success rate improvement on RAMA, attaining an average task completion rate of 0.94 while preserving standard-task performance. Real-robot experiments confirm strong robustness and generalization. To our knowledge, this is the first work to systematically define, evaluate, and enhance VLA instruction rejection capability.

Technology Category

Application Category

📝 Abstract
A fundamental requirement for real-world robotic deployment is the ability to understand and respond to natural language instructions. Existing language-conditioned manipulation tasks typically assume that instructions are perfectly aligned with the environment. This assumption limits robustness and generalization in realistic scenarios where instructions may be ambiguous, irrelevant, or infeasible. To address this problem, we introduce RAtional MAnipulation (RAMA), a new benchmark that challenges models with both unseen executable instructions and defective ones that should be rejected. In RAMA, we construct a dataset with over 14,000 samples, including diverse defective instructions spanning six dimensions: visual, physical, semantic, motion, safety, and out-of-context. We further propose the Rational Vision-Language-Action model (RationalVLA). It is a dual system for robotic arms that integrates the high-level vision-language model with the low-level manipulation policy by introducing learnable latent space embeddings. This design enables RationalVLA to reason over instructions, reject infeasible commands, and execute manipulation effectively. Experiments demonstrate that RationalVLA outperforms state-of-the-art baselines on RAMA by a 14.5% higher success rate and 0.94 average task length, while maintaining competitive performance on standard manipulation tasks. Real-world trials further validate its effectiveness and robustness in practical applications. Our project page is https://irpn-eai.github.io/rationalvla.
Problem

Research questions and friction points this paper is trying to address.

Handling ambiguous or infeasible robotic instructions
Improving robustness in language-conditioned manipulation tasks
Integrating vision-language models with manipulation policies
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual system integrating vision-language and manipulation
Learnable latent space embeddings for instruction reasoning
Rejects infeasible commands while executing feasible ones
🔎 Similar Papers
No similar papers found.
Wenxuan Song
Wenxuan Song
The Hong Kong University of Science and Technology (Guangzhou)
Vision-language-action ModelRobotics
J
Jiayi Chen
The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China
Wenxue Li
Wenxue Li
Harbin Institute of Technology Weihai
X
Xu He
Nanyang Technological University
H
Han Zhao
Westlake University, Hangzhou, China
Pengxiang Ding
Pengxiang Ding
Zhejiang University
Human Motion PredictionLarge Language ModelEmbodied AI
D
Donglin Wang
Westlake University, Hangzhou, China
S
Shiyan Su
Monash University, Melbourne, Australia
F
Feilong Tang
Monash University, Melbourne, Australia
Xuelian Cheng
Xuelian Cheng
Monash University
3D VisionMedical ImagingMachine Learning
Z
Zongyuan Ge
Monash University, Melbourne, Australia
Haoang Li
Haoang Li
Assistant Professor, Hong Kong University of Science and Technology (Guangzhou)
Robotics3D Computer Vision
H
Hesheng Wang
Shanghai Jiao Tong University
Z
Zhe Liu
Shanghai Jiao Tong University
Yunhui Liu
Yunhui Liu
Nanjing University
Graph Machine Learning