OASIS: Observation-Action Space Alignment via SE(3) Trajectory Prediction for Robotic Manipulation

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

243K/year

🤖 AI Summary

This work addresses the limitation of existing vision-language-action (VLA) models, whose intermediate representations are confined to the observation space and struggle to explicitly capture the geometric relationships of rigid-body motion. To overcome this, the authors propose a novel approach that aligns 3D perceptual representations—fusing visual, linguistic, and depth information—with the action space through SE(3) end-effector trajectory prediction. This method is the first to explicitly incorporate SE(3) geometric structure as a bridge between observations and actions in visuomotor policy learning, integrating pose-supervised trajectory prediction, 3D feature encoding, and chunked action generation. Experiments demonstrate that the proposed model significantly outperforms VLA and WAM baselines in both simulation and real-world settings, achieving substantial improvements in task success rate and out-of-distribution generalization.

📝 Abstract

Recent vision-language-action (VLA) models and world action models (WAMs) advance robotic manipulation by enriching intermediate representations with auxiliary spatial features or future visual-state prediction. However, these representations largely remain within the observation space and do not share the rigid-body geometry of the action space, forcing the action decoder to implicitly recover this geometry. We propose OASIS, a visuomotor policy that aligns the intermediate representation with the action space via $SE(3)$ end-effector trajectory prediction. OASIS couples a 3D-aware feature encoder that fuses vision-language and metric-depth features with an $SE(3)$ trajectory predictor that produces a camera-frame end-effector trajectory. Conditioned on the predictor's pose-supervised hidden states, the action decoder generates action chunks consistent with rigid-body motion. Across simulation and real-world experiments, OASIS outperforms VLA and WAM baselines in success rate and out-of-distribution generalization. Our project page is available at https://npuhandsome.github.io/OASIS_web.

Problem

Research questions and friction points this paper is trying to address.

visuomotor policy

action space

observation space

SE(3) trajectory

robotic manipulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

SE(3) trajectory prediction

observation-action alignment

visuomotor policy