Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference

📅 2026-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high energy consumption and latency in conventional large language model (LLM) inference, which stem from frequent data movement across memory hierarchies. To overcome this, the authors propose a wafer-scale SRAM-based in-memory computing architecture that eliminates off-chip data transfers by performing all operations in situ on-chip. The design incorporates token-grained fine-grained pipelining, distributed dynamic key-value (KV) cache management, and communication-aware task mapping to significantly enhance SRAM utilization and system fault tolerance. Experimental results demonstrate that the proposed architecture achieves up to 9.1× higher throughput and 17× better energy efficiency on a 13B-parameter model, with average improvements of 4.1× in throughput and 4.2× in energy efficiency.

Technology Category

Application Category

📝 Abstract
Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations: Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer. Experimental results show Ouroboros achieves average gains of $4.1\times$ in throughput and $4.2\times$ in energy efficiency, peaking at $9.1\times$ and $17\times$ for the 13B model. (*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)
Problem

Research questions and friction points this paper is trying to address.

Large Language Model Inference
Computing-in-Memory
Energy Efficiency
Data Movement
Latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-Grained Pipelining
Distributed Dynamic KV Cache
Communication-Aware Mapping
Wafer-Scale SRAM CIM
In-Memory Computing
🔎 Similar Papers
No similar papers found.
Y
Yiqi Liu
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Y
Yudong Pan
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
Mengdi Wang
Mengdi Wang
Institute of Computing Technology, Chinese Academy of Sciences
accelerator architecture designmulti-core system
S
Shixin Zhao
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
H
Haonan Zhu
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Hangzhou, China
Y
Yinhe Han
SKLP, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Lei Zhang
Lei Zhang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences
Agentic CodingReinforcement LearningLarge Language Model
Ying Wang
Ying Wang
Institute of Computing Technology, Chinese Academy of Sciences
Reliable Computer ArchitectureVLSI designMachine learningMemory system