🤖 AI Summary
Deploying large language models (LLMs) on edge and resource-constrained devices remains challenging due to reliance on specialized hardware (e.g., GPUs) and proprietary inference frameworks. Method: This paper introduces SQL-LLM—a novel approach that automatically compiles LLM inference computation graphs into standard SQL, enabling end-to-end Transformer inference natively within general-purpose relational databases. Contribution/Results: SQL-LLM is the first to express neural operators—including matrix multiplication and attention—completely in SQL; it pioneers in-database KV caching and disk-aware inference state management; and it formalizes attention via relational algebra. Experiments demonstrate that, under CPU-only and memory-constrained settings, SQL-LLM achieves up to 30× higher throughput than state-of-the-art CPU-based LLM frameworks for Llama3 inference—without any GPU dependency. It supports mainstream databases including PostgreSQL and SQLite, significantly improving accessibility, portability, and hardware agnosticism of LLM deployment.
📝 Abstract
Serving large language models (LLMs) often demands specialized hardware, dedicated frameworks, and substantial development efforts, which restrict their accessibility, especially for edge devices and organizations with limited technical resources. We propose a novel compiler that translates LLM inference graphs into SQL queries, enabling relational databases, one of the most widely used and mature software systems globally, to serve as the runtime. By mapping neural operators such as matrix multiplication and attention into relational primitives like joins and aggregations, our approach leverages database capabilities, including disk-based data management and native caching. Supporting key transformer components, such as attention mechanisms and key-value caching, our system generates SQL pipelines for end-to-end LLM inference. Using the Llama3 family as a case study, we demonstrate up to 30x speedup in token generation for memory-constrained scenarios comparable to competitive CPU-based frameworks. Our work offers an accessible, portable, and efficient solution, facilitating the serving of LLMs across diverse deployment environments.