Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference

📅 2026-03-03

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the high energy consumption and latency in conventional large language model (LLM) inference, which stem from frequent data movement across memory hierarchies. To overcome this, the authors propose a wafer-scale SRAM-based in-memory computing architecture that eliminates off-chip data transfers by performing all operations in situ on-chip. The design incorporates token-grained fine-grained pipelining, distributed dynamic key-value (KV) cache management, and communication-aware task mapping to significantly enhance SRAM utilization and system fault tolerance. Experimental results demonstrate that the proposed architecture achieves up to 9.1× higher throughput and 17× better energy efficiency on a 13B-parameter model, with average improvements of 4.1× in throughput and 4.2× in energy efficiency.

Technology Category

Application Category

📝 Abstract

Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations: Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer. Experimental results show Ouroboros achieves average gains of $4.1\times$ in throughput and $4.2\times$ in energy efficiency, peaking at $9.1\times$ and $17\times$ for the 13B model. (*Due to the notification of arXiv "The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)

Problem

Research questions and friction points this paper is trying to address.

Large Language Model Inference

Computing-in-Memory

Energy Efficiency

Data Movement

Latency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-Grained Pipelining

Distributed Dynamic KV Cache

Communication-Aware Mapping