Interact3D: Compositional 3D Generation of Interactive Objects

📅 2026-03-16

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of generating 3D composite objects from a single image while preserving geometric detail in occluded regions and ensuring spatial consistency among constituent parts. The authors propose a two-stage generation framework: first, high-quality individual 3D assets are synthesized under unified 3D guidance; then, physically plausible composition is achieved through global-local geometric alignment and differentiable signed distance function (SDF) optimization. A novel closed-loop refinement mechanism, driven by a vision-language model (VLM), iteratively corrects inconsistencies by integrating multi-view rendering with differentiable SDF-based collision penalties. This approach significantly improves geometric fidelity, spatial coherence, and physical plausibility of the generated composites, outperforming existing methods across multiple quantitative metrics.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in 3D generation have enabled the synthesis of high-fidelity individual assets. However, generating 3D compositional objects from single images--particularly under occlusions--remains challenging. Existing methods often degrade geometric details in hidden regions and fail to preserve the underlying object-object spatial relationships (OOR). We present a novel framework Interact3D designed to generate physically plausible interacting 3D compositional objects. Our approach first leverages advanced generative priors to curate high-quality individual assets with a unified 3D guidance scene. To physically compose these assets, we then introduce a robust two-stage composition pipeline. Based on the 3D guidance scene, the primary object is anchored through precise global-to-local geometric alignment (registration), while subsequent geometries are integrated using a differentiable Signed Distance Field (SDF)-based optimization that explicitly penalizes geometry intersections. To reduce challenging collisions, we further deploy a closed-loop, agentic refinement strategy. A Vision-Language Model (VLM) autonomously analyzes multi-view renderings of the composed scene, formulates targeted corrective prompts, and guides an image editing module to iteratively self-correct the generation pipeline. Extensive experiments demonstrate that Interact3D successfully produces promising collsion-aware compositions with improved geometric fidelity and consistent spatial relationships.

Problem

Research questions and friction points this paper is trying to address.

3D generation

compositional objects

occlusions

spatial relationships

geometric fidelity

Innovation

Methods, ideas, or system contributions that make the work stand out.

compositional 3D generation

geometric alignment

Signed Distance Field (SDF)