EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Existing monocular video generation methods suffer from geometric inconsistencies and occlusion artifacts under extreme viewpoints, degrading 4D video quality. To address this, we propose the first explicit representation based on a depth-embedded watertight mesh, unifying modeling of both visible and occluded regions. We further design a synthetic masking strategy to mitigate the scarcity of multi-view paired training data, and introduce a lightweight LoRA-based video diffusion adapter for efficient spatiotemporal modeling. Our method significantly outperforms state-of-the-art approaches in physical consistency, extreme-view fidelity, and temporal coherence—achieving high-quality, camera-controllable free-viewpoint 4D video generation without requiring auxiliary sensor inputs.

Technology Category

Application Category

📝 Abstract

Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality controllable videos from monocular input

Addressing geometric inconsistencies in extreme viewpoints

Overcoming lack of multi-view datasets for training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth Watertight Mesh ensures geometric consistency

Simulated masking strategy from monocular videos

LoRA-based adapter for coherent video synthesis

🔎 Similar Papers

SV4D: Dynamic 3D Content Generation with Multi-Frame and Multi-View Consistency