🤖 AI Summary
In robotic imitation learning, inaccurate camera extrinsic calibration—particularly due to local minima, poor generalization, and reliance on multiple markers or online interaction—remains a critical challenge. To address this, we propose an offline hand-eye calibration method requiring only a single fiducial marker. Our approach innovatively integrates vision foundation models (VFMs) with geometric constraints: first, leveraging VFMs to localize the marker, combined with point tracking, end-effector 3D trajectory estimation, and temporal PnP for coarse extrinsic initialization; then refining the solution via differentiable rendering optimization. The method is training-free, hardware-agnostic, and exhibits strong robustness and cross-platform generalizability. Evaluated on three heterogeneous robotic platforms, it significantly outperforms state-of-the-art approaches. Moreover, it simultaneously generates high-quality auxiliary annotations—including dense depth maps and part-level segmentation masks—without additional supervision.
📝 Abstract
Imitation learning has achieved remarkable success in a variety of robotic tasks by learning a mapping function from camera-space observations to robot-space actions. Recent work indicates that the use of robot-to-camera transformation information ({ie}, camera extrinsics) benefits the learning process and produces better results. However, camera extrinsics are oftentimes unavailable and estimation methods usually suffer from local minima and poor generalizations. In this paper, we present CalibAll, a simple yet effective method that extbf{requires only a single mark} and performs training-free, stable, and accurate camera extrinsic estimation across diverse robots and datasets through a coarse-to-fine calibration pipeline. In particular, we annotate a single mark on an end-effector (EEF), and leverage the correspondence ability emerged from vision foundation models (VFM) to automatically localize the corresponding mark across robots in diverse datasets. Using this mark, together with point tracking and the 3D EEF trajectory, we obtain a coarse camera extrinsic via temporal Perspective-n-Point (PnP). This estimate is further refined through a rendering-based optimization that aligns rendered and ground-true masks, yielding accurate and stable camera extrinsic. Experimental results demonstrate that our method outperforms state-of-the-art approaches, showing strong robustness and general effectiveness across three robot platforms. It also produces useful auxiliary annotations such as depth maps, link-wise masks, and end-effector 2D trajectories, which can further support downstream tasks.