Omni-View: Unlocking How Generation Facilitates Understanding in Unified 3D Model based on Multiview images

📅 2025-11-10

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the fragmentation between multi-view image-driven 3D scene understanding and generation. We propose a unified “generation-empowered understanding” framework. Methodologically, we design a dual-module architecture—texture and geometry—that operate synergistically: the texture module models spatiotemporal consistency to enable high-fidelity novel-view synthesis, while the geometry module incorporates explicit structural constraints to improve depth and surface normal estimation accuracy; a two-stage training strategy enables bidirectional optimization. Our key innovation lies in the first use of generative tasks (e.g., rendering) as supervisory signals to enhance 3D understanding—reversing the conventional unidirectional paradigm. On the VSI-Bench benchmark, our method achieves a state-of-the-art score of 55.4, while significantly outperforming prior approaches on both novel-view synthesis and geometric estimation tasks.

Technology Category

Application Category

📝 Abstract

This paper presents Omni-View, which extends the unified multimodal understanding and generation to 3D scenes based on multiview images, exploring the principle that"generation facilitates understanding". Consisting of understanding model, texture module, and geometry module, Omni-View jointly models scene understanding, novel view synthesis, and geometry estimation, enabling synergistic interaction between 3D scene understanding and generation tasks. By design, it leverages the spatiotemporal modeling capabilities of its texture module responsible for appearance synthesis, alongside the explicit geometric constraints provided by its dedicated geometry module, thereby enriching the model's holistic understanding of 3D scenes. Trained with a two-stage strategy, Omni-View achieves a state-of-the-art score of 55.4 on the VSI-Bench benchmark, outperforming existing specialized 3D understanding models, while simultaneously delivering strong performance in both novel view synthesis and 3D scene generation.

Problem

Research questions and friction points this paper is trying to address.

Extending multimodal understanding and generation to 3D scenes

Jointly modeling scene understanding, view synthesis, and geometry estimation

Exploring how generation tasks enhance 3D scene understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified multimodal understanding and generation in 3D

Joint modeling of understanding, synthesis, and geometry estimation

Two-stage training strategy for state-of-the-art performance

🔎 Similar Papers

No similar papers found.