Vid2Sid: Videos Can Help Close the Sim2Real Gap

๐Ÿ“… 2026-02-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of accurately calibrating physics parameters in robotic simulators to real hardware when only external camera videos are available and direct force or state measurements are absent. To this end, the authors propose Vid2Sid, the first closed-loop optimization framework that integrates vision-language models (VLMs) into system identification. By analyzing paired simulation-to-reality videos, Vid2Sid leverages foundation vision models and VLMs to perform interpretable semantic reasoning, diagnosing physical mismatches and iteratively refining parameters such as friction, damping, and stiffness. Evaluated on both rigid-body (MuJoCo) and soft-body (PyElastica) systems, Vid2Sid achieves simulation-to-simulation parameter recovery errors below 13%โ€”substantially outperforming baselines (28โ€“98%)โ€”and demonstrates superior average performance on unseen sim-to-real control tasks compared to conventional black-box optimization approaches.

Technology Category

Application Category

๐Ÿ“ Abstract
Calibrating a robot simulator's physics parameters (friction, damping, material stiffness) to match real hardware is often done by hand or with black-box optimizers that reduce error but cannot explain which physical discrepancies drive the error. When sensing is limited to external cameras, the problem is further compounded by perception noise and the absence of direct force or state measurements. We present Vid2Sid, a video-driven system identification pipeline that couples foundation-model perception with a VLM-in-the-loop optimizer that analyzes paired sim-real videos, diagnoses concrete mismatches, and proposes physics parameter updates with natural language rationales. We evaluate our approach on a tendon-actuated finger (rigid-body dynamics in MuJoCo) and a deformable continuum tentacle (soft-body dynamics in PyElastica). On sim2real holdout controls unseen during training, Vid2Sid achieves the best average rank across all settings, matching or exceeding black-box optimizers while uniquely providing interpretable reasoning at each iteration. Sim2sim validation confirms that Vid2Sid recovers ground-truth parameters most accurately (mean relative error under 13\% vs. 28--98\%), and ablation analysis reveals three calibration regimes. VLM-guided optimization excels when perception is clean and the simulator is expressive, while model-class limitations bound performance in more challenging settings.
Problem

Research questions and friction points this paper is trying to address.

sim2real
system identification
physics parameter calibration
video-based perception
robot simulation
Innovation

Methods, ideas, or system contributions that make the work stand out.

video-driven system identification
sim2real gap
vision-language model (VLM)
interpretable optimization
physics parameter calibration
๐Ÿ”Ž Similar Papers
No similar papers found.
K
Kevin Qiu
University of Warsaw, IDEAS NCBR
Y
Yu Zhang
EPFL
Marek Cygan
Marek Cygan
University of Warsaw
Parameterized ComplexityApproximation Algorithms
J
Josie Hughes
EPFL