🤖 AI Summary
This work addresses the challenge of system identification and simulation calibration for soft underwater robots, which is hindered by strong nonlinear fluid–structure interactions and the sim-to-real gap. The authors propose the first use of a vision-language model (VLM) for underwater robot system identification, enabling end-to-end calibration of a 16-parameter fish-like robot simulator by directly comparing real swimming videos with simulation outputs—without requiring hand-designed search strategies. By integrating backtracking line search to improve parameter update acceptance rates, the method facilitates zero-shot transfer of reinforcement learning policies from simulation to the physical robot. After calibration, the mean absolute error (MAE) in swimming speed drops to 7.4 mm/s, a 43% improvement over the next-best method, with consistent convergence across five trials. Downstream RL policies achieve 12% and 90% greater swimming distances on the real robot compared to BayesOpt and CMA-ES baselines, respectively.
📝 Abstract
We present Swim2Real, a pipeline that calibrates a 16-parameter robotic fish simulator from swimming videos using vision-language model (VLM) feedback, requiring no hand-designed search stages. Calibrating soft aquatic robots is particularly challenging because nonlinear fluid-structure coupling makes the parameter landscape chaotic, simplified fluid models introduce a persistent sim-to-real gap, and controlled aquatic experiments are difficult to reproduce. Prior work on this platform required three manually tailored stages to handle this complexity. The VLM compares simulated and real videos and proposes parameter updates. A backtracking line search then validates each step size, tripling the accept rate from 14% to 42% by recovering proposals where the direction is correct but the magnitude is too large. Swim2Real calibrates all 16 parameters simultaneously, most closely matching real fish velocities across all motor frequencies (MAE = 7.4 mm/s, 43% lower than the next-best method), with zero outlier seeds across five runs. Motor commands from the trained policy transfer to the physical fish at 50 Hz, completing the pipeline from swimming video to real-world deployment. Downstream RL policies swim 12% farther than those from BayesOpt-calibrated simulators and 90% farther than CMA-ES. These results demonstrate that VLM-guided calibration can close the sim-to-real gap for aquatic robots directly from video, enabling zero-shot RL transfer to physical swimmers without manual system identification, a step toward automated, general-purpose simulator tuning for underwater robotics.