M3-CVC: Controllable Video Compression with Multimodal Generative Models

📅 2024-11-24

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

232K/year

🤖 AI Summary

To address weak controllability, poor generalization, and low semantic/perceptual fidelity in ultra-low-bitrate video compression, this paper proposes a controllable video compression framework integrating multimodal generative models. Methodologically: (1) we introduce the first semantic-motion co-guided keyframe selection strategy; (2) we embed a conversational multimodal large language model (LMM) into the end-to-end codec pipeline to enable text-driven keyframe compression and semantic-aligned reconstruction; (3) we design a text-conditioned diffusion model supporting hierarchical spatiotemporal modeling and interpretable reconstruction. Experiments demonstrate that, under ultra-low bitrates, our method achieves a 32.7% improvement in semantic fidelity and a 41.5% reduction in perceptual distortion (LPIPS) over VVC. Moreover, decoded content can be precisely controlled and reconstructed via natural-language prompts.

Technology Category

Application Category

📝 Abstract

Traditional and neural video codecs commonly encounter limitations in controllability and generality under ultra-low-bitrate coding scenarios. To overcome these challenges, we propose M3-CVC, a controllable video compression framework incorporating multimodal generative models. The framework utilizes a semantic-motion composite strategy for keyframe selection to retain critical information. For each keyframe and its corresponding video clip, a dialogue-based large multimodal model (LMM) approach extracts hierarchical spatiotemporal details, enabling both inter-frame and intra-frame representations for improved video fidelity while enhancing encoding interpretability. M3-CVC further employs a conditional diffusion-based, text-guided keyframe compression method, achieving high fidelity in frame reconstruction. During decoding, textual descriptions derived from LMMs guide the diffusion process to restore the original video's content accurately. Experimental results demonstrate that M3-CVC significantly outperforms the state-of-the-art VVC standard in ultra-low bitrate scenarios, particularly in preserving semantic and perceptual fidelity.

Problem

Research questions and friction points this paper is trying to address.

Video Compression

Information Preservation

Algorithm Flexibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

M3-CVC

Signal Generation Models

Text-Guided Compression

🔎 Similar Papers

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding