Drag4D: Align Your Motion with Text-Driven 3D Scene Generation

📅 2025-09-26

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenges of user-defined object motion control and spatiotemporal misalignment in text-driven 3D scene generation. We propose a “3D Copy-Paste” framework integrated with a physics-aware positional learning mechanism and a multi-view-consistent motion diffusion model, enabling—for the first time—the precise spatiotemporal mapping of user-specified 2D drag trajectories onto high-fidelity 3D scenes. Technically, our approach unifies 2D Gaussian splatting rendering, panoramic scene completion, plug-and-play image-to-3D priors, multi-view video diffusion, and projection-based trajectory-conditioned generation. While preserving text-semantic fidelity, our method significantly improves motion plausibility and cross-view consistency, effectively suppressing motion hallucination. It supports interactive, high-quality dynamic integration of objects into 3D scenes.

Technology Category

Application Category

📝 Abstract

We introduce Drag4D, an interactive framework that integrates object motion control within text-driven 3D scene generation. This framework enables users to define 3D trajectories for the 3D objects generated from a single image, seamlessly integrating them into a high-quality 3D background. Our Drag4D pipeline consists of three stages. First, we enhance text-to-3D background generation by applying 2D Gaussian Splatting with panoramic images and inpainted novel views, resulting in dense and visually complete 3D reconstructions. In the second stage, given a reference image of the target object, we introduce a 3D copy-and-paste approach: the target instance is extracted in a full 3D mesh using an off-the-shelf image-to-3D model and seamlessly composited into the generated 3D scene. The object mesh is then positioned within the 3D scene via our physics-aware object position learning, ensuring precise spatial alignment. Lastly, the spatially aligned object is temporally animated along a user-defined 3D trajectory. To mitigate motion hallucination and ensure view-consistent temporal alignment, we develop a part-augmented, motion-conditioned video diffusion model that processes multiview image pairs together with their projected 2D trajectories. We demonstrate the effectiveness of our unified architecture through evaluations at each stage and in the final results, showcasing the harmonized alignment of user-controlled object motion within a high-quality 3D background.

Problem

Research questions and friction points this paper is trying to address.

Enables interactive 3D trajectory control for generated objects

Integrates object motion into text-driven 3D scene generation

Ensures view-consistent temporal alignment to prevent motion hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive framework for text-driven 3D scene generation

3D copy-paste method with physics-aware object positioning

Part-augmented motion-conditioned video diffusion model

🔎 Similar Papers

No similar papers found.