Realizing Immersive Volumetric Video: A Multimodal Framework for 6-DoF VR Engagement

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenge of constructing immersive volumetric video content from real-world footage that supports large-scale six-degree-of-freedom (6-DoF) interaction and high-fidelity audiovisual feedback. To this end, the authors introduce a novel format termed Immersive Volumetric Video (IVV), present ImViD—a multi-view, multimodal dataset—and develop a comprehensive generation pipeline incorporating dynamic Gaussian light field representation, optical-flow-guided sparse initialization, joint camera temporal calibration, and multi-view sound field reconstruction. This work establishes the first formal definition of the IVV format and proposes the first sound field reconstruction method leveraging multi-view audiovisual data. The resulting system generates high-quality content at 5K resolution and 60 FPS, spanning 1–5 minutes, which demonstrates significantly superior performance over existing approaches in virtual reality, offering exceptional 6-DoF interactivity and audiovisual immersion.

Technology Category

Application Category

📝 Abstract
Fully immersive experiences that tightly integrate 6-DoF visual and auditory interaction are essential for virtual and augmented reality. While such experiences can be achieved through computer-generated content, constructing them directly from real-world captured videos remains largely unexplored. We introduce Immersive Volumetric Videos, a new volumetric media format designed to provide large 6-DoF interaction spaces, audiovisual feedback, and high-resolution, high-frame-rate dynamic content. To support IVV construction, we present ImViD, a multi-view, multi-modal dataset built upon a space-oriented capture philosophy. Our custom capture rig enables synchronized multi-view video-audio acquisition during motion, facilitating efficient capture of complex indoor and outdoor scenes with rich foreground--background interactions and challenging dynamics. The dataset provides 5K-resolution videos at 60 FPS with durations of 1-5 minutes, offering richer spatial, temporal, and multimodal coverage than existing benchmarks. Leveraging this dataset, we develop a dynamic light field reconstruction framework built upon a Gaussian-based spatio-temporal representation, incorporating flow-guided sparse initialization, joint camera temporal calibration, and multi-term spatio-temporal supervision for robust and accurate modeling of complex motion. We further propose, to our knowledge, the first method for sound field reconstruction from such multi-view audiovisual data. Together, these components form a unified pipeline for immersive volumetric video production. Extensive benchmarks and immersive VR experiments demonstrate that our pipeline generates high-quality, temporally stable audiovisual volumetric content with large 6-DoF interaction spaces. This work provides both a foundational definition and a practical construction methodology for immersive volumetric videos.
Problem

Research questions and friction points this paper is trying to address.

Immersive Volumetric Video
6-DoF
multimodal
audiovisual interaction
real-world capture
Innovation

Methods, ideas, or system contributions that make the work stand out.

Immersive Volumetric Video
6-DoF VR
Multimodal Capture
Dynamic Light Field Reconstruction
Sound Field Reconstruction
🔎 Similar Papers
No similar papers found.
Z
Zhengxian Yang
Tsinghua University, Beijing, China
S
Shengqi Wang
Tsinghua University, Beijing, China
S
Shi Pan
Tsinghua University, Beijing, China
H
Hongshuai Li
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Haoxiang Wang
Haoxiang Wang
清华大学
privacymechanism design
Lin Li
Lin Li
Huawei
Code GenerationNLPPsychology
Guanjun Li
Guanjun Li
Institute of Automation,Chinese Academy of Sciences
Audio ProcessingAudio-visual Leanring
Zhengqi Wen
Zhengqi Wen
Tshinghua University
LLM
B
Borong Lin
School of Architecture, Tsinghua University, Beijing, China
J
Jianhua Tao
Department of Automation, Tsinghua University, Beijing, China
Tao Yu
Tao Yu
Tsinghua University
Computer VisionComputer GraphicsDeep Learning