🤖 AI Summary
This work addresses the high energy consumption and latency in conventional large language model (LLM) inference, which stem from frequent data movement across memory hierarchies. To overcome this, the authors propose a wafer-scale SRAM-based in-memory computing architecture that eliminates off-chip data transfers by performing all operations in situ on-chip. The design incorporates token-grained fine-grained pipelining, distributed dynamic key-value (KV) cache management, and communication-aware task mapping to significantly enhance SRAM utilization and system fault tolerance. Experimental results demonstrate that the proposed architecture achieves up to 9.1× higher throughput and 17× better energy efficiency on a 13B-parameter model, with average improvements of 4.1× in throughput and 4.2× in energy efficiency.
📝 Abstract
Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations:
Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer.
Experimental results show Ouroboros achieves average gains of $4.1\times$ in throughput and $4.2\times$ in energy efficiency, peaking at $9.1\times$ and $17\times$ for the 13B model.
(*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)