SLIM: Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

191K/year

🤖 AI Summary

Existing image compression models are designed for human visual perception, preserving perceptually redundant details that degrade bit-rate efficiency for machine vision tasks. To address this, we propose a semantic-driven low-bitrate image compression framework that jointly integrates region-of-interest (RoI)-aware compression with text-guided latent diffusion model (LDM) reconstruction—marking the first such integration. Our approach leverages a pre-trained LDM to implicitly attend to semantically critical regions without requiring explicit RoI annotations or inference-time masks. We further introduce an RoI-aware encoder and a text-conditioned U-Net denoiser, where semantic captions guide the reconstruction process. The method maintains high subjective visual quality while significantly improving downstream classification accuracy. At equivalent bits per pixel (bpp), it outperforms existing machine-vision-oriented compression methods, achieving state-of-the-art performance.

Technology Category

Application Category

📝 Abstract

In recent years, the demand of image compression models for machine vision has increased dramatically. However, the training frameworks of image compression still focus on the vision of human, maintaining the excessive perceptual details, thus have limitations in optimally reducing the bits per pixel in the case of performing machine vision tasks. In this paper, we propose Semantic-based Low-bitrate Image compression for Machines by leveraging diffusion, termed SLIM. This is a new effective training framework of image compression for machine vision, using a pretrained latent diffusion model.The compressor model of our method focuses only on the Region-of-Interest (RoI) areas for machine vision in the image latent, to compress it compactly. Then the pretrained Unet model enhances the decompressed latent, utilizing a RoI-focused text caption which containing semantic information of the image. Therefore, SLIM is able to focus on RoI areas of the image without any guide mask at the inference stage, achieving low bitrate when compressing. And SLIM is also able to enhance a decompressed latent by denoising steps, so the final reconstructed image from the enhanced latent can be optimized for the machine vision task while still containing perceptual details for human vision. Experimental results show that SLIM achieves a higher classification accuracy in the same bits per pixel condition, compared to conventional image compression models for machines.Code will be released upon acceptance.

Problem

Research questions and friction points this paper is trying to address.

Develops low-bitrate image compression for machine vision tasks

Focuses on Region-of-Interest areas without needing inference masks

Enhances decompressed images for both machine and human vision

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses pretrained latent diffusion model for compression

Focuses compression on Region-of-Interest areas without masks

Enhances decompressed latent with semantic text captions

🔎 Similar Papers

No similar papers found.