M3-CVC: Controllable Video Compression with Multimodal Generative Models

📅 2024-11-24
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address weak controllability, poor generalization, and low semantic/perceptual fidelity in ultra-low-bitrate video compression, this paper proposes a controllable video compression framework integrating multimodal generative models. Methodologically: (1) we introduce the first semantic-motion co-guided keyframe selection strategy; (2) we embed a conversational multimodal large language model (LMM) into the end-to-end codec pipeline to enable text-driven keyframe compression and semantic-aligned reconstruction; (3) we design a text-conditioned diffusion model supporting hierarchical spatiotemporal modeling and interpretable reconstruction. Experiments demonstrate that, under ultra-low bitrates, our method achieves a 32.7% improvement in semantic fidelity and a 41.5% reduction in perceptual distortion (LPIPS) over VVC. Moreover, decoded content can be precisely controlled and reconstructed via natural-language prompts.

Technology Category

Application Category

📝 Abstract
Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video's content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.
Problem

Research questions and friction points this paper is trying to address.

Video Compression
Information Preservation
Algorithm Flexibility
Innovation

Methods, ideas, or system contributions that make the work stand out.

M3-CVC
Signal Generation Models
Text-Guided Compression
🔎 Similar Papers
No similar papers found.
R
Rui Wan
State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai, China
Q
Qi Zheng
State Key Laboratory of Integrated Chips and Systems, Fudan University, Shanghai, China
Yibo Fan
Yibo Fan
Professor, Fudan University
Video CodingImage ProcessingProcessorVLSI Design