OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing vision-language-action (VLA) models, whose intermediate representations are confined to the observation space and struggle to explicitly capture the geometric relationships of rigid-body motion. To overcome this, the authors propose a novel approach that aligns 3D perceptual representations—fusing visual, linguistic, and depth information—with the action space through SE(3) end-effector trajectory prediction. This method is the first to explicitly incorporate SE(3) geometric structure as a bridge between observations and actions in visuomotor policy learning, integrating pose-supervised trajectory prediction, 3D feature encoding, and chunked action generation. Experiments demonstrate that the proposed model significantly outperforms VLA and WAM baselines in both simulation and real-world settings, achieving substantial improvements in task success rate and out-of-distribution generalization.
📝 Abstract
Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.
Problem

Research questions and friction points this paper is trying to address.

visuomotor policy
action space
observation space
SE(3) trajectory
robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

SE(3) trajectory prediction
observation-action alignment
visuomotor policy
3D-aware representation
rigid-body motion consistency
X
Xinzhe Chen
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
S
Sihua Ren
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
L
Liqi Huang
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Haowen Sun
Haowen Sun
Department of Automation, Tsinghua University
Computer Vision
M
Mingyang Li
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
Xingyu Chen
Xingyu Chen
PhD Candidate, University of Technology Sydney, Australian National University
Spatial AudioHRTF
Z
Zeyang Liu
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University
X
Xuguang Lan
National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University