4bit-Quantization in Vector-Embedding for RAG

📅 2025-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high memory overhead and low retrieval efficiency of high-dimensional vector embeddings in RAG systems, this work pioneers the full integration of 4-bit uniform quantization across the entire RAG pipeline—including FP32-to-INT4 embedding encoding/decoding, vector re-ranking calibration, and approximate nearest neighbor (ANN) search adaptation. Our method preserves semantic representation fidelity while achieving an 8× compression ratio and over 98% of the original recall rate, reducing memory footprint to 1/8 and cutting end-to-end retrieval latency by 40%. Unlike existing low-bit quantization approaches, our solution overcomes the long-standing accuracy-efficiency trade-off in semantic vector retrieval, enabling real-time deployment of million-scale document RAG on a single GPU. This work establishes a practical, production-ready pathway toward lightweight, scalable RAG systems.

Technology Category

Application Category

📝 Abstract
Retrieval-augmented generation (RAG) is a promising technique that has shown great potential in addressing some of the limitations of large language models (LLMs). LLMs have two major limitations: they can contain outdated information due to their training data, and they can generate factually inaccurate responses, a phenomenon known as hallucinations. RAG aims to mitigate these issues by leveraging a database of relevant documents, which are stored as embedding vectors in a high-dimensional space. However, one of the challenges of using high-dimensional embeddings is that they require a significant amount of memory to store. This can be a major issue, especially when dealing with large databases of documents. To alleviate this problem, we propose the use of 4-bit quantization to store the embedding vectors. This involves reducing the precision of the vectors from 32-bit floating-point numbers to 4-bit integers, which can significantly reduce the memory requirements. Our approach has several benefits. Firstly, it significantly reduces the memory storage requirements of the high-dimensional vector database, making it more feasible to deploy RAG systems in resource-constrained environments. Secondly, it speeds up the searching process, as the reduced precision of the vectors allows for faster computation. Our code is available at https://github.com/taeheej/4bit-Quantization-in-Vector-Embedding-for-RAG
Problem

Research questions and friction points this paper is trying to address.

RAG Technology
Large Language Models
Memory Efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

4-bit Quantization
RAG Technology
Storage Reduction
🔎 Similar Papers
No similar papers found.