🤖 AI Summary
This study addresses the challenge of constructing immersive volumetric video content from real-world footage that supports large-scale six-degree-of-freedom (6-DoF) interaction and high-fidelity audiovisual feedback. To this end, the authors introduce a novel format termed Immersive Volumetric Video (IVV), present ImViD—a multi-view, multimodal dataset—and develop a comprehensive generation pipeline incorporating dynamic Gaussian light field representation, optical-flow-guided sparse initialization, joint camera temporal calibration, and multi-view sound field reconstruction. This work establishes the first formal definition of the IVV format and proposes the first sound field reconstruction method leveraging multi-view audiovisual data. The resulting system generates high-quality content at 5K resolution and 60 FPS, spanning 1–5 minutes, which demonstrates significantly superior performance over existing approaches in virtual reality, offering exceptional 6-DoF interactivity and audiovisual immersion.
📝 Abstract
Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.