Stable Offline Hand-Eye Calibration for any Robot with Just One Mark

📅 2025-11-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In robotic imitation learning, inaccurate camera extrinsic calibration—particularly due to local minima, poor generalization, and reliance on multiple markers or online interaction—remains a critical challenge. To address this, we propose an offline hand-eye calibration method requiring only a single fiducial marker. Our approach innovatively integrates vision foundation models (VFMs) with geometric constraints: first, leveraging VFMs to localize the marker, combined with point tracking, end-effector 3D trajectory estimation, and temporal PnP for coarse extrinsic initialization; then refining the solution via differentiable rendering optimization. The method is training-free, hardware-agnostic, and exhibits strong robustness and cross-platform generalizability. Evaluated on three heterogeneous robotic platforms, it significantly outperforms state-of-the-art approaches. Moreover, it simultaneously generates high-quality auxiliary annotations—including dense depth maps and part-level segmentation masks—without additional supervision.

Technology Category

Application Category

📝 Abstract
Imitation learning has achieved remarkable success in a variety of robotic tasks by learning a mapping function from camera-space observations to robot-space actions. Recent work indicates that the use of robot-to-camera transformation information ({ie}, camera extrinsics) benefits the learning process and produces better results. However, camera extrinsics are oftentimes unavailable and estimation methods usually suffer from local minima and poor generalizations. In this paper, we present CalibAll, a simple yet effective method that extbf{requires only a single mark} and performs training-free, stable, and accurate camera extrinsic estimation across diverse robots and datasets through a coarse-to-fine calibration pipeline. In particular, we annotate a single mark on an end-effector (EEF), and leverage the correspondence ability emerged from vision foundation models (VFM) to automatically localize the corresponding mark across robots in diverse datasets. Using this mark, together with point tracking and the 3D EEF trajectory, we obtain a coarse camera extrinsic via temporal Perspective-n-Point (PnP). This estimate is further refined through a rendering-based optimization that aligns rendered and ground-true masks, yielding accurate and stable camera extrinsic. Experimental results demonstrate that our method outperforms state-of-the-art approaches, showing strong robustness and general effectiveness across three robot platforms. It also produces useful auxiliary annotations such as depth maps, link-wise masks, and end-effector 2D trajectories, which can further support downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Camera extrinsics estimation suffers from local minima and poor generalization
Existing methods require complex setups and lack training-free solutions
Accurate robot-to-camera transformation is unavailable for diverse robotic platforms
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single mark annotation for camera extrinsic estimation
Coarse-to-fine calibration using vision foundation models
Rendering-based optimization aligning rendered and true masks
🔎 Similar Papers
No similar papers found.
S
Sicheng Xie
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University.
Lingchen Meng
Lingchen Meng
Qwen Team, Alibaba Group; Fudan University
Large Multimodal Models
Z
Zhiying Du
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University.
S
Shuyuan Tu
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University.
H
Haidong Cao
Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University.
Jiaqi Leng
Jiaqi Leng
University of California, Berkeley
Quantum ComputationOptimizationScientific Computing
Zuxuan Wu
Zuxuan Wu
Fudan University
Yu-Gang Jiang
Yu-Gang Jiang
Professor, Fudan University. IEEE & IAPR Fellow
Video AnalysisEmbodied AITrustworthy AI