A Three-Level Alignment Framework for Large-Scale 3D Retrieval and Controlled 4D Generation

📅 2025-12-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

To address the challenges of cross-modal alignment under open-vocabulary settings and weak temporal consistency in 3D retrieval and 4D generation, this paper proposes Uni4D—a unified framework. Methodologically, it introduces a novel text–3D–image three-level structured alignment mechanism, integrating 3D-text multi-head attention, multi-view geometric representations, and cross-modal contrastive learning for precise semantic matching; additionally, a temporal-constrained generation module is designed to enforce dynamic consistency. Evaluated on the Align3D-130 dataset, Uni4D achieves significant improvements in Recall@K for 3D retrieval and—uniquely among existing methods—enables high-fidelity, fine-grained text-controllable, and temporally coherent 4D asset generation. This work overcomes key bottlenecks in joint multimodal understanding and generation, establishing a new paradigm for industrial-scale multimodal content production.

Technology Category

Application Category

📝 Abstract

We introduce Uni4D, a unified framework for large scale open vocabulary 3D retrieval and controlled 4D generation based on structured three level alignment across text, 3D models, and image modalities. Built upon the Align3D 130 dataset, Uni4D employs a 3D text multi head attention and search model to optimize text to 3D retrieval through improved semantic alignment. The framework further strengthens cross modal alignment through three components: precise text to 3D retrieval, multi view 3D to image alignment, and image to text alignment for generating temporally consistent 4D assets. Experimental results demonstrate that Uni4D achieves high quality 3D retrieval and controllable 4D generation, advancing dynamic multimodal understanding and practical applications.

Problem

Research questions and friction points this paper is trying to address.

Uni4D enables large-scale open-vocabulary 3D retrieval via semantic alignment.

It enhances cross-modal alignment among text, 3D models, and images.

The framework generates controllable, temporally consistent 4D assets from inputs.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified framework for 3D retrieval and 4D generation

Three-level alignment across text, 3D models, and images

Multi-head attention model improves semantic alignment for retrieval

🔎 Similar Papers

No similar papers found.