🤖 AI Summary
To address the high storage overhead and low computational efficiency of high-dimensional floating-point embeddings in similarity search, this paper proposes an ultra-low-bit quantization method. Leveraging geometric modeling via high-dimensional convex polytopes, we establish the first theoretical framework that compresses each embedding dimension into a 1.58-bit ternary symbol (−1, 0, +1), enabling efficient approximate inner-product and distance estimation. Our approach combines symbolic encoding with convex polytope constraints to preserve high correlation with original similarity measures while drastically reducing memory footprint and retrieval latency. Experiments demonstrate that retrieval accuracy closely matches full-precision floating-point baselines, with bounded and controllable correlation loss. This work introduces a novel compression paradigm for high-dimensional neural embeddings—rigorous in theory and practical in deployment—achieving unprecedented bit efficiency without compromising search quality.
📝 Abstract
Many modern search domains comprise high-dimensional vectors of floating point numbers derived from neural networks, in the form of embeddings. Typical embeddings range in size from hundreds to thousands of dimensions, making the size of the embeddings, and the speed of comparison, a significant issue. Quantisation is a class of mechanism which replaces the floating point values with a smaller representation, for example a short integer. This gives an approximation of the embedding space in return for a smaller data representation and a faster comparison function. Here we take this idea almost to its extreme: we show how vectors of arbitrary-precision floating point values can be replaced by vectors whose elements are drawn from the set {-1,0,1}. This yields very significant savings in space and metric evaluation cost, while maintaining a strong correlation for similarity measurements. This is achieved by way of a class of convex polytopes which exist in the high-dimensional space. In this article we give an outline description of these objects, and show how they can be used for the basis of such radical quantisation while maintaining a surprising degree of accuracy.