USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

📅 2025-10-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Underwater robots face significant challenges—including hydrodynamic disturbances, low visibility, and constrained communication—exacerbated by the scarcity of large-scale, high-quality multimodal data, which hinders the development of general embodied intelligence. To address this, we introduce USIM, the first vision-language-action (VLA) dataset specifically designed for underwater robotics, comprising 15.6 hours of multi-task interaction data from BlueROV2 operating across nine complex underwater scenarios. We further propose U0, a general-purpose underwater robot model featuring a novel Convolutional Attention Perception-enhancement (CAP) module and a dedicated multimodal fusion architecture. U0 integrates simulation-augmented data generation, stereo vision, and multi-sensor fusion, trained via deep reinforcement learning and cross-modal alignment. Experiments demonstrate that U0 achieves >80% success rates across diverse tasks and reduces target approach distance by 21.2% compared to baselines, significantly advancing underwater spatial understanding and autonomous manipulation capability.

Technology Category

Application Category

📝 Abstract
Underwater environments present unique challenges for robotic operation, including complex hydrodynamics, limited visibility, and constrained communication. Although data-driven approaches have advanced embodied intelligence in terrestrial robots and enabled task-specific autonomous underwater robots, developing underwater intelligence capable of autonomously performing multiple tasks remains highly challenging, as large-scale, high-quality underwater datasets are still scarce. To address these limitations, we introduce USIM, a simulation-based multi-task Vision-Language-Action (VLA) dataset for underwater robots. USIM comprises over 561K frames from 1,852 trajectories, totaling approximately 15.6 hours of BlueROV2 interactions across 20 tasks in 9 diverse scenarios, ranging from visual navigation to mobile manipulation. Building upon this dataset, we propose U0, a VLA model for general underwater robots, which integrates binocular vision and other sensor modalities through multimodal fusion, and further incorporates a convolution-attention-based perception focus enhancement module (CAP) to improve spatial understanding and mobile manipulation. Across tasks such as inspection, obstacle avoidance, scanning, and dynamic tracking, the framework achieves a success rate of 80%, while in challenging mobile manipulation tasks, it reduces the distance to the target by 21.2% compared with baseline methods, demonstrating its effectiveness. USIM and U0 show that VLA models can be effectively applied to underwater robotic applications, providing a foundation for scalable dataset construction, improved task autonomy, and the practical realization of intelligent general underwater robots.
Problem

Research questions and friction points this paper is trying to address.

Addressing underwater robotics challenges through simulation dataset
Developing vision-language-action model for multi-task autonomous operations
Improving spatial understanding and manipulation in constrained environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation-based multi-task Vision-Language-Action underwater dataset
VLA model with multimodal fusion and perception enhancement module
Convolution-attention module improves spatial understanding and manipulation
🔎 Similar Papers
No similar papers found.
J
Junwen Gu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Z
Zhiheng Wu
Baidu Inc., Beijing 100085, China
P
Pengxuan Si
The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Shuang Qiu
Shuang Qiu
City University of Hong Kong
Reinforcement LearningAgentic AILarge Language ModelsEmbodied AI
Y
Yukai Feng
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Luoyang Sun
Luoyang Sun
Institute of Automation, Chinese Academy of Sciences
Machine Learning
L
Laien Luo
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
L
Lianyi Yu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
J
Jian Wang
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China
Zhengxing Wu
Zhengxing Wu
The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China; The School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049, China