Towards Generalizable Robotic Manipulation in Dynamic Environments

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language-action (VLA) models suffer significant performance degradation in dynamic environments due to insufficient temporal awareness and a lack of relevant training data. To address this, this work proposes PUMA, a dynamics-aware VLA architecture, along with DOMINO, a large-scale dynamic manipulation dataset, and the first hierarchical task benchmark for general-purpose dynamic manipulation. PUMA implicitly predicts short-term future states of target objects by integrating scene-centric historical optical flow encoding, object-centric world querying, and expert trajectory imitation learning, thereby enhancing spatiotemporal generalization. Experiments demonstrate that PUMA achieves state-of-the-art performance on dynamic tasks, improving success rates by 6.3% over baseline methods, and its dynamics-trained representations effectively transfer to static manipulation tasks.

Technology Category

Application Category

📝 Abstract
Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.
Problem

Research questions and friction points this paper is trying to address.

dynamic manipulation
Vision-Language-Action models
spatiotemporal reasoning
moving targets
generalizable robotic manipulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

dynamic manipulation
Vision-Language-Action (VLA)
spatiotemporal reasoning
optical flow
generalizable robotics
H
Heng Fang
Huazhong University of Science and Technology
S
Shangru Li
Huazhong University of Science and Technology
S
Shuhan Wang
Huazhong University of Science and Technology
X
Xuanyang Xi
Huawei Technologies Co. Ltd
Dingkang Liang
Dingkang Liang
Huazhong University of Science and Technology
Embodied AIWorld ModelAutonomous DrivingCrowd Counting
Xiang Bai
Xiang Bai
Huazhong University of Science and Technology (HUST)
Computer VisionOCR