🤖 AI Summary
We address the challenging problem of zero-shot 6D pose estimation for unseen objects—requiring no object-specific annotations or model fine-tuning. Our method introduces a training-agnostic, cross-modal inference framework that jointly leverages vision-language foundation models (CLIP) and differentiable Signed Distance Function (SDF)-based geometric representations. This enables an end-to-end differentiable pose optimization pipeline that synergistically integrates semantic and geometric cues. Crucially, our approach eliminates reliance on supervised training: given only a single RGB image and a CAD model, it directly estimates accurate 6D poses. On standard benchmarks including NOCS and OmniObject3D, our method achieves state-of-the-art zero-shot performance, operating at 20 FPS. It demonstrates significantly improved generalization to unseen categories, arbitrary viewpoints, and heavily occluded scenes—without any task-specific adaptation or retraining.