🤖 AI Summary
Amidst the breakdown of Moore’s Law and escalating hardware heterogeneity, vector-centric AI systems—including RAG, vector search, and recommendation engines—face severe efficiency bottlenecks. This paper proposes a full-stack co-optimization framework spanning algorithms, systems, and hardware. We introduce PipeRAG, RAGO, and Chameleon to enhance RAG inference throughput and reduce latency; FANNS and Falcon for graph-based vector search co-optimized with heterogeneous accelerators; and MicroRec and FleetRec leveraging embedding table compression, data layout restructuring, and pipelined scheduling to minimize memory overhead. Our techniques encompass quantization-aware search, heterogeneous architecture adaptation, and cross-layer co-design. Extensive evaluations across diverse platforms demonstrate 2.1–5.8× lower inference latency, 3.4–7.2× higher throughput, and 37%–64% reduced resource consumption—validating both effectiveness and generalizability.
📝 Abstract
Today, two major trends are shaping the evolution of ML systems. First, modern AI systems are becoming increasingly complex, often integrating components beyond the model itself. A notable example is Retrieval-Augmented Generation (RAG), which incorporates not only multiple models but also vector databases, leading to heterogeneity in both system components and underlying hardware. Second, with the end of Moore's Law, achieving high system efficiency is no longer feasible without accounting for the rapid evolution of the hardware landscape.
Building on the observations above, this thesis adopts a cross-stack approach to improving ML system efficiency, presenting solutions that span algorithms, systems, and hardware. First, it introduces several pioneering works about RAG serving efficiency across the computing stack. PipeRAG focuses on algorithm-level improvements, RAGO introduces system-level optimizations, and Chameleon explores heterogeneous accelerator systems for RAG. Second, this thesis investigates algorithm-hardware co-design for vector search. Specifically, FANNS and Falcon optimize quantization-based and graph-based vector search, the two most popular paradigms of retrieval algorithms. Third, this thesis addresses the serving efficiency of recommender systems, another example of vector-centric ML systems, where the memory-intensive lookup operations on embedding vector tables often represent a major performance bottleneck. MicroRec and FleetRec propose solutions at the hardware and system levels, respectively, optimizing both data movement and computation to enhance the efficiency of large-scale recommender models.