🤖 AI Summary
Existing low-bitrate face image compression methods suffer from poor reconstruction quality, loss of high-frequency details, and degraded performance on downstream tasks (e.g., face recognition). To address these issues, we propose FaSDiff, a frequency-domain consistent compression framework leveraging Stable Diffusion priors. FaSDiff employs a frequency-aware compressor to decouple low- and high-frequency components, and integrates a hybrid low-frequency enhancement module with a frequency-domain modulation mechanism to jointly optimize perceptual fidelity and machine-readable semantic consistency. The method is trained end-to-end without post-processing. Extensive experiments demonstrate that FaSDiff significantly outperforms state-of-the-art approaches across multiple benchmarks: at ultra-low bitrates (0.1–0.5 bpp), it achieves PSNR/SSIM gains of 1.2–2.8 dB and improves face recognition accuracy by 3.5–7.1%. To our knowledge, FaSDiff is the first method to achieve a unified balance between visual quality and semantic usability in low-bitrate face compression.
📝 Abstract
With the widespread application of facial image data across various domains, the efficient storage and transmission of facial images has garnered significant attention. However, the existing learned face image compression methods often produce unsatisfactory reconstructed image quality at low bit rates. Simply adapting diffusion-based compression methods to facial compression tasks results in reconstructed images that perform poorly in downstream applications due to insufficient preservation of high-frequency information. To further explore the diffusion prior in facial image compression, we propose Facial Image Compression with a Stable Diffusion Prior (FaSDiff), a method that preserves consistency through frequency enhancement. FaSDiff employs a high-frequency-sensitive compressor in an end-to-end framework to capture fine image details and produce robust visual prompts. Additionally, we introduce a hybrid low-frequency enhancement module that disentangles low-frequency facial semantics and stably modulates the diffusion prior alongside visual prompts. The proposed modules allow FaSDiff to leverage diffusion priors for superior human visual perception while minimizing performance loss in machine vision due to semantic inconsistency. Extensive experiments show that FaSDiff outperforms state-of-the-art methods in balancing human visual quality and machine vision accuracy. The code will be released after the paper is accepted.