Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

📅 2025-08-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
3D generation is hindered by the scarcity of high-quality 3D annotations. To address this, we leverage commonsense priors—such as spatial consistency and dynamic structural evolution—implicitly encoded in multi-view videos. We introduce Droplet3D-4M, the first large-scale multi-view video–3D paired dataset containing 4 million samples, and propose a generative model supporting joint image and dense textual inputs. Our method is the first to systematically integrate geometric and semantic priors from video into 3D generation, explicitly modeling 3D structural evolution under viewpoint transformations via multi-view geometric constraints and vision–language alignment. Experiments demonstrate significant improvements over state-of-the-art methods in text–3D alignment, geometric consistency, and detail fidelity, achieving new SOTA performance across multiple benchmarks. All data, code, and models are publicly released.

Technology Category

Application Category

📝 Abstract
Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Addressing 3D data scarcity using video commonsense priors
Generating spatially consistent 3D content from video supervision
Creating semantically plausible 3D assets from text prompts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video commonsense priors for 3D generation
Introduces large-scale multi-view annotated video dataset
Trains generative model supporting image and text input
🔎 Similar Papers
No similar papers found.