USIM and U0: A Vision-Language-Action Dataset and Model for General Underwater Robots

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

211K/year

🤖 AI Summary

Underwater robots face significant challenges—including hydrodynamic disturbances, low visibility, and constrained communication—exacerbated by the scarcity of large-scale, high-quality multimodal data, which hinders the development of general embodied intelligence. To address this, we introduce USIM, the first vision-language-action (VLA) dataset specifically designed for underwater robotics, comprising 15.6 hours of multi-task interaction data from BlueROV2 operating across nine complex underwater scenarios. We further propose U0, a general-purpose underwater robot model featuring a novel Convolutional Attention Perception-enhancement (CAP) module and a dedicated multimodal fusion architecture. U0 integrates simulation-augmented data generation, stereo vision, and multi-sensor fusion, trained via deep reinforcement learning and cross-modal alignment. Experiments demonstrate that U0 achieves >80% success rates across diverse tasks and reduces target approach distance by 21.2% compared to baselines, significantly advancing underwater spatial understanding and autonomous manipulation capability.

Technology Category

Application Category

📝 Abstract

Underwater environments present unique challenges for robotic operation, including complex hydrodynamics, limited visibility, and constrained communication. Although data-driven approaches have advanced embodied intelligence in terrestrial robots and enabled task-specific autonomous underwater robots, developing underwater intelligence capable of autonomously performing multiple tasks remains highly challenging, as large-scale, high-quality underwater datasets are still scarce. To address these limitations, we introduce USIM, a simulation-based multi-task Vision-Language-Action (VLA) dataset for underwater robots. USIM comprises over 561K frames from 1,852 trajectories, totaling approximately 15.6 hours of BlueROV2 interactions across 20 tasks in 9 diverse scenarios, ranging from visual navigation to mobile manipulation. Building upon this dataset, we propose U0, a VLA model for general underwater robots, which integrates binocular vision and other sensor modalities through multimodal fusion, and further incorporates a convolution-attention-based perception focus enhancement module (CAP) to improve spatial understanding and mobile manipulation. Across tasks such as inspection, obstacle avoidance, scanning, and dynamic tracking, the framework achieves a success rate of 80%, while in challenging mobile manipulation tasks, it reduces the distance to the target by 21.2% compared with baseline methods, demonstrating its effectiveness. USIM and U0 show that VLA models can be effectively applied to underwater robotic applications, providing a foundation for scalable dataset construction, improved task autonomy, and the practical realization of intelligent general underwater robots.

Problem

Research questions and friction points this paper is trying to address.

Addressing underwater robotics challenges through simulation dataset

Developing vision-language-action model for multi-task autonomous operations

Improving spatial understanding and manipulation in constrained environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Simulation-based multi-task Vision-Language-Action underwater dataset

VLA model with multimodal fusion and perception enhancement module

Convolution-attention module improves spatial understanding and manipulation

🔎 Similar Papers

Word2Wave: Language Driven Mission Programming for Efficient Subsea Deployments of Marine Robots