Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models

📅 2025-09-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing lossy image compression methods prioritize pixel-level reconstruction, failing to meet emerging demands for semantic fidelity and robustness across diverse data distributions and downstream tasks. Method: We propose a semantics-first compression paradigm that abandons pixel reconstruction entirely. Leveraging multimodal foundation models (e.g., CLIP), we extract high-level semantic features and compress their embeddings via a lightweight encoder at ultra-low bitrates (~2–3×10⁻³ bpp), integrating quantization and entropy coding for efficient representation. The method requires no task-specific fine-tuning and enables zero-shot generalization. Results: Under extreme compression—using less than 5% of the bitrate of conventional methods—our approach achieves superior semantic consistency, cross-distribution robustness, and downstream performance (e.g., classification, retrieval) compared to state-of-the-art methods. It establishes a scalable, semantics-driven framework for visual compression.

Technology Category

Application Category

📝 Abstract
Recent deep learning-based methods for lossy image compression achieve competitive rate-distortion performance through extensive end-to-end training and advanced architectures. However, emerging applications increasingly prioritize semantic preservation over pixel-level reconstruction and demand robust performance across diverse data distributions and downstream tasks. These challenges call for advanced semantic compression paradigms. Motivated by the zero-shot and representational capabilities of multimodal foundation models, we propose a novel semantic compression method based on the contrastive language-image pretraining (CLIP) model. Rather than compressing images for reconstruction, we propose compressing the CLIP feature embeddings into minimal bits while preserving semantic information across different tasks. Experiments show that our method maintains semantic integrity across benchmark datasets, achieving an average bit rate of approximately 2-3* 10(-3) bits per pixel. This is less than 5% of the bitrate required by mainstream image compression approaches for comparable performance. Remarkably, even under extreme compression, the proposed approach exhibits zero-shot robustness across diverse data distributions and downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Compressing images for semantic preservation over pixel reconstruction
Maintaining semantic integrity across diverse data distributions
Achieving extreme compression rates while preserving task performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

CLIP feature embeddings compression
Preserving semantic information across tasks
Achieving ultra-low bit rates
🔎 Similar Papers
No similar papers found.
R
Ruiqi Shen
Department of Electrical and Electronic Engineering, Imperial College London
H
Haotian Wu
Department of Electrical and Electronic Engineering, Imperial College London
W
Wenjing Zhang
Department of Electrical and Electronic Engineering, Imperial College London
J
Jiangjing Hu
Department of Electrical and Electronic Engineering, Imperial College London
Deniz Gunduz
Deniz Gunduz
Professor of Information Processing, Imperial College London
Wireless CommunicationsInformation TheoryPrivacyMachine learning