VLIC: Vision-Language Models As Perceptual Judges for Human-Aligned Image Compression

📅 2025-12-17

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Traditional image compression evaluation relies on distortion metrics such as MSE, which exhibit significant misalignment with human perceptual judgment. To address this, we propose VLIC—the first perception-aligned compression framework leveraging diffusion models and frozen vision-language models (VLMs, e.g., CLIP or LLaVA) as zero-shot preference discriminators. Crucially, VLIC bypasses fine-tuning or distillation; instead, it employs VLMs to perform binary Alternative Forced Choice (AFC) comparisons on compressed image pairs, generating preference-based reward signals to guide diffusion model post-training. This paradigm establishes the first native, parameter-free VLM-driven perceptual guidance for compression. Extensive experiments demonstrate that VLIC achieves state-of-the-art performance across perceptual metrics—including LPIPS and DISTS—as well as in large-scale user studies, significantly outperforming both classical and learned compression methods.

Technology Category

Application Category

📝 Abstract

Evaluations of image compression performance which include human preferences have generally found that naive distortion functions such as MSE are insufficiently aligned to human perception. In order to align compression models to human perception, prior work has employed differentiable perceptual losses consisting of neural networks calibrated on large-scale datasets of human psycho-visual judgments. We show that, surprisingly, state-of-the-art vision-language models (VLMs) can replicate binary human two-alternative forced choice (2AFC) judgments zero-shot when asked to reason about the differences between pairs of images. Motivated to exploit the powerful zero-shot visual reasoning capabilities of VLMs, we propose Vision-Language Models for Image Compression (VLIC), a diffusion-based image compression system designed to be post-trained with binary VLM judgments. VLIC leverages existing techniques for diffusion model post-training with preferences, rather than distilling the VLM judgments into a separate perceptual loss network. We show that calibrating this system on VLM judgments produces competitive or state-of-the-art performance on human-aligned visual compression depending on the dataset, according to perceptual metrics and large-scale user studies. We additionally conduct an extensive analysis of the VLM-based reward design and training procedure and share important insights. More visuals are available at https://kylesargent.github.io/vlic

Problem

Research questions and friction points this paper is trying to address.

Evaluates image compression alignment with human perception

Uses vision-language models for perceptual judgments zero-shot

Proposes diffusion-based compression trained with VLM binary preferences

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vision-language models for zero-shot image quality judgments

Trains diffusion-based compression with VLM binary preference feedback

Replaces perceptual loss networks with direct VLM reward calibration

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning