31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

📅 2026-05-10

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenges of large language model (LLM) inference—namely, high computational demands, memory bandwidth bottlenecks, and outlier-induced limitations in low-bit quantization—by proposing an efficient ReRAM-on-Logic stacked architecture. The design introduces a novel local rotation unit to enable outlier-free low-bit quantization, integrates block-wise clustered weight compression with a stack-aware processing-in-memory (PNM) architecture, and incorporates an adaptive parallel speculative decoding mechanism with out-of-order scheduling. Implemented in 55 nm CMOS technology, the prototype chip achieves measured inference throughput of 14.08–135.69 tokens per second, delivering a 4.46× to 7.17× speedup over conventional speculative decoding approaches.

📝 Abstract

This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization to reduce weight EMA overheads, and an adaptive parallel speculative decoding scheme with an out-of-order scheduler for high resource and bandwidth utilization. Our chip achieves 14.08-to-135.69token/s and 4.46-to-7.17x speedup over vanilla speculative decoding.

Problem

Research questions and friction points this paper is trying to address.

Large Language Model

Outlier-Free Quantization

Weight Compression

Speculative Decoding

ReRAM-on-Logic

Innovation

Methods, ideas, or system contributions that make the work stand out.

ReRAM-on-Logic stacking

outlier-free quantization

blockwise vector quantization