🤖 AI Summary
This work addresses the representational gap between high-dimensional, high-precision embeddings used in retrieval-augmented generation (RAG) and the low-dimensional, low-bit memory arrays of compute-in-memory (CiM) hardware, which hinders RAG deployment on edge devices. To bridge this gap, the authors propose a hardware-aware joint compression and quantization (CQ) framework that introduces, for the first time, a unified CiM-oriented data reshaping methodology tailored for RAG tasks. By enabling end-to-end co-optimization of dimensionality reduction and precision quantization, the framework adapts seamlessly to diverse CiM architectures—including SRAM, ReRAM, and FeFET—significantly improving both retrieval accuracy and hardware compatibility. This approach provides an efficient, standardized embedding input scheme for a wide range of CiM systems.
📝 Abstract
Deploying Retrieval-Augmented Generation (RAG) on edge devices is in high demand, but is hindered by the latency of massive data movement and computation on traditional architectures. Compute-in-Memory (CiM) architectures address this bottleneck by performing vector search directly within their crossbar structure. However, CiM's adoption for RAG is limited by a fundamental ``representation gap,'' as high-precision, high-dimension embeddings are incompatible with CiM's low-precision, low-dimension array constraints. This gap is compounded by the diversity of CiM implementations (e.g., SRAM, ReRAM, FeFET), each with unique designs (e.g., 2-bit cells, 512x512 arrays). Consequently, RAG data must be naively reshaped to fit each target implementation. Current data shaping methods handle dimension and precision disjointly, which degrades data fidelity. This not only negates the advantages of CiM for RAG but also confuses hardware designers, making it unclear if a failure is due to the circuit design or the degraded input data. As a result, CiM adoption remains limited. In this paper, we introduce CQ-CiM, a unified, hardware-aware data shaping framework that jointly learns Compression and Quantization to produce CiM-compatible low-bit embeddings for diverse CiM designs. To the best of our knowledge, this is the first work to shape data for comprehensive CiM usage on RAG.