Vid2Sid: Videos Can Help Close the Sim2Real Gap

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

242K/year

🤖 AI Summary

This work addresses the challenge of accurately calibrating physics parameters in robotic simulators to real hardware when only external camera videos are available and direct force or state measurements are absent. To this end, the authors propose Vid2Sid, the first closed-loop optimization framework that integrates vision-language models (VLMs) into system identification. By analyzing paired simulation-to-reality videos, Vid2Sid leverages foundation vision models and VLMs to perform interpretable semantic reasoning, diagnosing physical mismatches and iteratively refining parameters such as friction, damping, and stiffness. Evaluated on both rigid-body (MuJoCo) and soft-body (PyElastica) systems, Vid2Sid achieves simulation-to-simulation parameter recovery errors below 13%—substantially outperforming baselines (28–98%)—and demonstrates superior average performance on unseen sim-to-real control tasks compared to conventional black-box optimization approaches.

Technology Category

Application Category

📝 Abstract

Calibrating a robot simulator's physics parameters (friction, damping, material stiffness) to match real hardware is often done by hand or with black-box optimizers that reduce error but cannot explain which physical discrepancies drive the error. When sensing is limited to external cameras, the problem is further compounded by perception noise and the absence of direct force or state measurements. We present Vid2Sid, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales. We evaluate our approach on a tendon-actuated finger (rigid-body dynamics in MuJoCo) and a deformable continuum tentacle (soft-body dynamics in PyElastica). On sim2real holdout controls unseen during training, Vid2Sid achieves the best average rank across all settings, matching or exceeding black-box optimizers while uniquely providing interpretable reasoning at each iteration. Sim2sim validation confirms that Vid2Sid recovers ground-truth parameters most accurately (mean relative error under 13\% vs. 28--98\%), and ablation analysis reveals three calibration regimes. VLM-guided optimization excels when perception is clean and the simulator is expressive, while model-class limitations bound performance in more challenging settings.

Problem

Research questions and friction points this paper is trying to address.

sim2real

system identification

physics parameter calibration

video-based perception

robot simulation

Innovation

Methods, ideas, or system contributions that make the work stand out.

video-driven system identification

sim2real gap

vision-language model (VLM)