ComNeck: Bridging Compressed Image Latents and Multimodal LLMs via Universal Transform-Neck

📅 2024-07-29

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

To bridge the semantic gap between neural image compression and multimodal large language model (MLLM) visual understanding on resource-constrained edge devices, this paper proposes a general, lightweight adaptation framework that requires no downstream MLLM fine-tuning. Methodologically, it introduces: (1) a transform-neck module—designed for cross-MLLM reusability—that decouples compression encoder training from downstream models; and (2) a proxy loss function targeting latent-space alignment, enabling efficient mapping from compressed bitstream features to the MLLM’s visual encoder. The framework supports three mainstream compression paradigms—pretrained, joint, and perception-only—and demonstrates consistent improvements in rate-distortion performance across diverse neural codecs and MLLM vision tasks. Notably, it achieves substantial computational savings while maintaining strong deployment efficiency and cross-model generalization capability.

Technology Category

Application Category

📝 Abstract

This paper presents the first-ever study of adapting compressed image latents to suit the needs of downstream vision tasks that adopt Multimodal Large Language Models (MLLMs). MLLMs have extended the success of large language models to modalities (e.g. images) beyond text, but their billion scale hinders deployment on resource-constrained end devices. While cloud-hosted MLLMs could be available, transmitting raw, uncompressed images captured by end devices to the cloud requires an efficient image compression system. To address this, we focus on emerging neural image compression and propose a novel framework with a lightweight transform-neck and a surrogate loss to adapt compressed image latents for MLLM-based vision tasks. The proposed framework is generic and applicable to multiple application scenarios, where the neural image codec can be (1) pre-trained for human perception without updating, (2) fully updated for joint human and machine perception, or (3) fully updated for only machine perception. The transform-neck trained with the surrogate loss is universal, for it can serve various downstream vision tasks enabled by a variety of MLLMs that share the same visual encoder. Our framework has the striking feature of excluding the downstream MLLMs from training the transform-neck, and potentially the neural image codec as well. This stands out from most existing coding for machine approaches that involve downstream networks in training and thus could be impractical when the networks are MLLMs. Extensive experiments on different neural image codecs and various MLLM-based vision tasks show that our method achieves great rate-accuracy performance with much less complexity, demonstrating its effectiveness.

Problem

Research questions and friction points this paper is trying to address.

Adapting compressed image latents for MLLMs

Efficient image transmission to cloud

Lightweight framework for MLLM vision tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapts compressed image latents

Lightweight transform-neck framework

Excludes full MLLM in training

🔎 Similar Papers

VoCo-LLaMA: Towards Vision Compression with Large Language Models