🤖 AI Summary
Existing vector quantization methods are constrained by static codebooks, limiting their ability to adapt to the heterogeneous geometric structures of data, while dynamic quantizers often suffer from inefficient serial decoding. This work proposes RQ-MoE, a novel framework that introduces the mixture-of-experts (MoE) mechanism into residual quantization for the first time. By employing a two-layer MoE module within a dual-stream architecture, RQ-MoE enables input-adaptive dynamic codebook construction and decouples token generation from quantization to facilitate parallel decoding. The framework unifies standard residual quantization and QINCo as special cases and provides design guidelines for expert dimensionality. Experiments demonstrate that RQ-MoE achieves state-of-the-art or comparable performance in reconstruction and retrieval tasks while delivering 6–14× faster decoding speeds than existing approaches.
📝 Abstract
Vector quantization is a fundamental tool for compressing high-dimensional embeddings, yet existing multi-codebook methods rely on static codebooks that limit expressiveness under heterogeneous data geometry. While recent dynamic quantizers like QINCo adapt codebooks to individual inputs and improve expressiveness, their strict sequential dependencies create decoding bottlenecks. We propose Residual Quantization via Mixture of Experts (RQ-MoE), a framework combining a two-level MoE with dual-stream quantization to enable input-dependent codebook adaptation for efficient vector quantization. RQ-MoE enables dynamic codebook construction and decouples instruction from quantization, facilitating parallel decoding. Theoretically, we show that standard Residual Quantization and QINCo can be recovered as constrained special cases of RQ-MoE, and derive a guideline for setting expert dimensionality in RQ-MoE. Extensive experiments show that RQ-MoE achieves state-of-the-art or on-par performance in reconstruction and retrieval, while providing 6x-14x faster decoding than prior vector quantization methods. The implementation is available at https://github.com/KDEGroup/RQ-MoE.