Copy-Trasform-Paste: Zero-Shot Object-Object Alignment Guided by Vision-Language and Geometric Constraints

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

This work proposes a zero-shot method for aligning two 3D meshes using only textual prompts describing their spatial relationships. At test time, the approach directly optimizes the relative pose by leveraging semantic gradients from CLIP-driven differentiable rendering, combined with a soft ICP variant, interpenetration penalties, a staged scheduling of geometric constraints, and a camera strategy that focuses on interaction regions. This framework is the first to integrate vision-language models with geometry-aware objectives, achieving semantically accurate and physically plausible alignments without training any new models. Evaluated on a newly curated benchmark encompassing diverse object categories and spatial relations, the method significantly outperforms existing approaches.

Technology Category

Application Category

📝 Abstract

We study zero-shot 3D alignment of two given meshes, using a text prompt describing their spatial relation -- an essential capability for content creation and scene assembly. Earlier approaches primarily rely on geometric alignment procedures, while recent work leverages pretrained 2D diffusion models to model language-conditioned object-object spatial relationships. In contrast, we directly optimize the relative pose at test time, updating translation, rotation, and isotropic scale with CLIP-driven gradients via a differentiable renderer, without training a new model. Our framework augments language supervision with geometry-aware objectives: a variant of soft-Iterative Closest Point (ICP) term to encourage surface attachment and a penetration loss to discourage interpenetration. A phased schedule strengthens contact constraints over time, and camera control concentrates the optimization on the interaction region. To enable evaluation, we curate a benchmark containing diverse categories and relations, and compare against baselines. Our method outperforms all alternatives, yielding semantically faithful and physically plausible alignments.

Problem

Research questions and friction points this paper is trying to address.

zero-shot alignment

3D object alignment

vision-language guidance

geometric constraints

object-object spatial relationship

Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot 3D alignment

vision-language guidance

geometric constraints