🤖 AI Summary
Vector quantization (VQ)-based image reconstruction has long suffered from low codebook utilization, resulting in strong coupling between reconstruction fidelity and codebook size. To address this, we propose a global-local dual-codebook co-design mechanism: a lightweight Transformer dynamically updates the global codebook to model semantic consistency, while a deterministic feature selection strategy constructs a local codebook to capture fine-grained structural details; both are jointly optimized end-to-end. Our method requires no pretraining and achieves, for the first time, efficient VQ reconstruction trained from scratch. At a modest codebook size of 512, it achieves significantly lower FID than state-of-the-art methods using thousand-level codebooks—particularly excelling in face and complex-scene reconstruction. Moreover, it reduces computational overhead by over 40%, achieving an unprecedented balance between high fidelity and high efficiency.
📝 Abstract
Vector Quantization (VQ) techniques face significant challenges in codebook utilization, limiting reconstruction fidelity in image modeling. We introduce a Dual Codebook mechanism that effectively addresses this limitation by partitioning the representation into complementary global and local components. The global codebook employs a lightweight transformer for concurrent updates of all code vectors, while the local codebook maintains precise feature representation through deterministic selection. This complementary approach is trained from scratch without requiring pre-trained knowledge. Experimental evaluation across multiple standard benchmark datasets demonstrates state-of-the-art reconstruction quality while using a compact codebook of size 512 - half the size of previous methods that require pre-training. Our approach achieves significant FID improvements across diverse image domains, particularly excelling in scene and face reconstruction tasks. These results establish Dual Codebook VQ as an efficient paradigm for high-fidelity image reconstruction with significantly reduced computational requirements.