🤖 AI Summary
Underwater robots face significant challenges—including hydrodynamic disturbances, low visibility, and constrained communication—exacerbated by the scarcity of large-scale, high-quality multimodal data, which hinders the development of general embodied intelligence. To address this, we introduce USIM, the first vision-language-action (VLA) dataset specifically designed for underwater robotics, comprising 15.6 hours of multi-task interaction data from BlueROV2 operating across nine complex underwater scenarios. We further propose U0, a general-purpose underwater robot model featuring a novel Convolutional Attention Perception-enhancement (CAP) module and a dedicated multimodal fusion architecture. U0 integrates simulation-augmented data generation, stereo vision, and multi-sensor fusion, trained via deep reinforcement learning and cross-modal alignment. Experiments demonstrate that U0 achieves >80% success rates across diverse tasks and reduces target approach distance by 21.2% compared to baselines, significantly advancing underwater spatial understanding and autonomous manipulation capability.
📝 Abstract
Underwater environments present unique challenges for robotic operation, including complex hydrodynamics, limited visibility, and constrained communication. Although data-driven approaches have advanced embodied intelligence in terrestrial robots and enabled task-specific autonomous underwater robots, developing underwater intelligence capable of autonomously performing multiple tasks remains highly challenging, as large-scale, high-quality underwater datasets are still scarce. To address these limitations, we introduce USIM, a simulation-based multi-task Vision-Language-Action (VLA) dataset for underwater robots. USIM comprises over 561K frames from 1,852 trajectories, totaling approximately 15.6 hours of BlueROV2 interactions across 20 tasks in 9 diverse scenarios, ranging from visual navigation to mobile manipulation. Building upon this dataset, we propose U0, a VLA model for general underwater robots, which integrates binocular vision and other sensor modalities through multimodal fusion, and further incorporates a convolution-attention-based perception focus enhancement module (CAP) to improve spatial understanding and mobile manipulation. Across tasks such as inspection, obstacle avoidance, scanning, and dynamic tracking, the framework achieves a success rate of 80%, while in challenging mobile manipulation tasks, it reduces the distance to the target by 21.2% compared with baseline methods, demonstrating its effectiveness. USIM and U0 show that VLA models can be effectively applied to underwater robotic applications, providing a foundation for scalable dataset construction, improved task autonomy, and the practical realization of intelligent general underwater robots.