🤖 AI Summary
Existing monocular video generation methods suffer from geometric inconsistencies and occlusion artifacts under extreme viewpoints, degrading 4D video quality. To address this, we propose the first explicit representation based on a depth-embedded watertight mesh, unifying modeling of both visible and occluded regions. We further design a synthetic masking strategy to mitigate the scarcity of multi-view paired training data, and introduce a lightweight LoRA-based video diffusion adapter for efficient spatiotemporal modeling. Our method significantly outperforms state-of-the-art approaches in physical consistency, extreme-view fidelity, and temporal coherence—achieving high-quality, camera-controllable free-viewpoint 4D video generation without requiring auxiliary sensor inputs.
📝 Abstract
Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.