4D-MoDe: Towards Editable and Scalable Volumetric Streaming via Motion-Decoupled 4D Gaussian Compression

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of large data volume, complex motion modeling, and poor editability in dynamic volumetric video streaming, this paper proposes a hierarchical motion-decoupled 4D Gaussian representation. The scene is decomposed into static background and dynamic foreground components; temporal consistency is ensured via adaptive keyframe insertion and independent foreground streaming encoding. A multi-resolution motion estimation grid, lightweight shared MLP, entropy-aware training, and joint range coding with KD-tree compression are integrated to achieve rate-distortion optimization. Evaluated on multiple standard benchmarks, the method achieves an average storage cost of only 11.4 KB per frame while matching state-of-the-art reconstruction quality. Crucially, it enables fine-grained editing operations—such as background replacement—thereby significantly enhancing volumetric video editability, scalability, and streaming efficiency.

Technology Category

Application Category

📝 Abstract
Volumetric video has emerged as a key medium for immersive telepresence and augmented/virtual reality, enabling six-degrees-of-freedom (6DoF) navigation and realistic spatial interactions. However, delivering high-quality dynamic volumetric content at scale remains challenging due to massive data volume, complex motion, and limited editability of existing representations. In this paper, we present 4D-MoDe, a motion-decoupled 4D Gaussian compression framework designed for scalable and editable volumetric video streaming. Our method introduces a layered representation that explicitly separates static backgrounds from dynamic foregrounds using a lookahead-based motion decomposition strategy, significantly reducing temporal redundancy and enabling selective background/foreground streaming. To capture continuous motion trajectories, we employ a multi-resolution motion estimation grid and a lightweight shared MLP, complemented by a dynamic Gaussian compensation mechanism to model emergent content. An adaptive grouping scheme dynamically inserts background keyframes to balance temporal consistency and compression efficiency. Furthermore, an entropy-aware training pipeline jointly optimizes the motion fields and Gaussian parameters under a rate-distortion (RD) objective, while employing range-based and KD-tree compression to minimize storage overhead. Extensive experiments on multiple datasets demonstrate that 4D-MoDe consistently achieves competitive reconstruction quality with an order of magnitude lower storage cost (e.g., as low as extbf{11.4} KB/frame) compared to state-of-the-art methods, while supporting practical applications such as background replacement and foreground-only streaming.
Problem

Research questions and friction points this paper is trying to address.

Delivering high-quality dynamic volumetric content at scale remains challenging
Existing representations suffer from massive data volume and limited editability
Complex motion patterns create temporal redundancy in volumetric streaming
Innovation

Methods, ideas, or system contributions that make the work stand out.

Motion-decoupled 4D Gaussian compression separates static and dynamic content
Multi-resolution motion estimation with dynamic Gaussian compensation mechanism
Entropy-aware training optimizes motion fields and Gaussian parameters jointly
🔎 Similar Papers
No similar papers found.
Houqiang Zhong
Houqiang Zhong
Shanghai Jiao Tong University
Zihan Zheng
Zihan Zheng
Ph.D. Candidate, Shanghai Jiao Tong University
artificial intelligencedeep learningcomputer vision
Q
Qiang Hu
Shanghai Jiao Tong University, Shanghai, 200240, China
Y
Yuan Tian
Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
N
Ning Cao
E-surfing Vision Technology Co., Ltd, Hangzhou, 311100, China
L
Lan Xu
ShanghaiTech University, Shanghai, 201210, China
X
Xiaoyun Zhang
Shanghai Jiao Tong University, Shanghai, 200240, China
Zhengxue Cheng
Zhengxue Cheng
Assistant Researcher, Shanghai Jiao Tong University
Video and Image CodingComputer VisionImage Quality Assessment
Li Song
Li Song
Professor of Electronic Engineering, Shanghai Jiao Tong University
Video CodingImage ProcessingComputer Vision
Wenjun Zhang
Wenjun Zhang
City University of Hong Kong
Thin film technologynanomaterials and nanodevices