4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

📅 2025-06-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing robot pretraining methods rely solely on incomplete observation inputs, leading to suboptimal action distribution modeling due to coordinate system misalignment and state inconsistency—termed “state confusion”—which severely impairs generalization. To address this, we propose 4D-VLA, the first framework that unifies RGB-D sequences into a 4D spatiotemporal representation by jointly encoding depth and temporal dimensions. It enforces cross-scene coordinate alignment to harmonize robot–environment reference frames and introduces a memory bank–driven keyframe sampling strategy to enhance temporal modeling and spatial perception. Evaluated on both simulated and real-world tasks, 4D-VLA significantly outperforms OpenVLA in success rate. Moreover, on our newly introduced multi-view benchmark MV-Bench, it demonstrates superior spatial understanding and viewpoint generalization capabilities.

Technology Category

Application Category

📝 Abstract
Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset's action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame sampling strategy designed to extract informative frames from historical images, further improving effectiveness and efficiency. Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over OpenVLA. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.
Problem

Research questions and friction points this paper is trying to address.

Addresses coordinate system and state chaos in robotic pretraining
Integrates 4D spatiotemporal information to align robot-scene coordinates
Improves spatial perception and generalization with multi-view benchmarks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates 4D information for chaos mitigation
Uses sequential RGB-D inputs for alignment
Employs memory bank sampling for efficiency
🔎 Similar Papers
No similar papers found.
J
Jiahui Zhang
School of Data Science, Fudan University
Y
Yurui Chen
School of Data Science, Fudan University
Y
Yueming Xu
School of Data Science, Fudan University
Z
Ze Huang
School of Data Science, Fudan University
Yanpeng Zhou
Yanpeng Zhou
NOAH'S ARK LAB
Yu-Jie Yuan
Yu-Jie Yuan
Institute of Computing Technology, Chinese Academy of Sciences
Computer Graphics3D VisionMLLM
X
Xinyue Cai
Huawei Noah’s Ark Lab
G
Guowei Huang
Huawei Noah’s Ark Lab
X
Xingyue Quan
Huawei Noah’s Ark Lab
H
Hang Xu
Huawei Noah’s Ark Lab
L
Li Zhang
School of Data Science, Fudan University