🤖 AI Summary
To address the high memory overhead and low retrieval efficiency of high-dimensional vector embeddings in RAG systems, this work pioneers the full integration of 4-bit uniform quantization across the entire RAG pipeline—including FP32-to-INT4 embedding encoding/decoding, vector re-ranking calibration, and approximate nearest neighbor (ANN) search adaptation. Our method preserves semantic representation fidelity while achieving an 8× compression ratio and over 98% of the original recall rate, reducing memory footprint to 1/8 and cutting end-to-end retrieval latency by 40%. Unlike existing low-bit quantization approaches, our solution overcomes the long-standing accuracy-efficiency trade-off in semantic vector retrieval, enabling real-time deployment of million-scale document RAG on a single GPU. This work establishes a practical, production-ready pathway toward lightweight, scalable RAG systems.
📝 Abstract
Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG