🤖 AI Summary
This work proposes a zero-shot method for aligning two 3D meshes using only textual prompts describing their spatial relationships. At test time, the approach directly optimizes the relative pose by leveraging semantic gradients from CLIP-driven differentiable rendering, combined with a soft ICP variant, interpenetration penalties, a staged scheduling of geometric constraints, and a camera strategy that focuses on interaction regions. This framework is the first to integrate vision-language models with geometry-aware objectives, achieving semantically accurate and physically plausible alignments without training any new models. Evaluated on a newly curated benchmark encompassing diverse object categories and spatial relations, the method significantly outperforms existing approaches.
📝 Abstract
We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.