🤖 AI Summary
Existing no-training perceptual image encoders rely on diffusion inversion or sample communication, resulting in per-image decoding times on the order of minutes—severely limiting practicality. This paper proposes a training-free, encoder-agnostic perceptual enhancement method operating solely at the decoder side, compatible with arbitrary (including non-differentiable) codecs such as VTM, ELIC, and MS-ILLM. Leveraging a pre-trained unconditional generative model, our approach employs gradient approximation and iterative refinement to achieve millisecond-to-second decoding latency, while providing theoretically grounded perceptual-distortion trade-offs. Under decoding budgets of 0.1–10 seconds, our method matches the Fréchet Inception Distance (FID) of minute-scale alternatives and significantly outperforms conditional generative approaches including HiFiC and MS-ILLM. Moreover, it substantially improves the perceptual quality of mainstream standard codecs without modifying their encoding pipelines.
📝 Abstract
Training-free perceptual image codec adopt pre-trained unconditional generative model during decoding to avoid training new conditional generative model. However, they heavily rely on diffusion inversion or sample communication, which take 1 min to intractable amount of time to decode a single image. In this paper, we propose a training-free algorithm that improves the perceptual quality of any existing codec with theoretical guarantee. We further propose different implementations for optimal perceptual quality when decoding time budget is $approx 0.1$s, $0.1-10$s and $ge 10$s. Our approach: 1). improves the decoding time of training-free codec from 1 min to $0.1-10$s with comparable perceptual quality. 2). can be applied to non-differentiable codec such as VTM. 3). can be used to improve previous perceptual codecs, such as MS-ILLM. 4). can easily achieve perception-distortion trade-off. Empirically, we show that our approach successfully improves the perceptual quality of ELIC, VTM and MS-ILLM with fast decoding. Our approach achieves comparable FID to previous training-free codec with significantly less decoding time. And our approach still outperforms previous conditional generative model based codecs such as HiFiC and MS-ILLM in terms of FID. The source code is provided in the supplementary material.