31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

239K/year
🤖 AI Summary
This work addresses the challenges of large language model (LLM) inference—namely, high computational demands, memory bandwidth bottlenecks, and outlier-induced limitations in low-bit quantization—by proposing an efficient ReRAM-on-Logic stacked architecture. The design introduces a novel local rotation unit to enable outlier-free low-bit quantization, integrates block-wise clustered weight compression with a stack-aware processing-in-memory (PNM) architecture, and incorporates an adaptive parallel speculative decoding mechanism with out-of-order scheduling. Implemented in 55 nm CMOS technology, the prototype chip achieves measured inference throughput of 14.08–135.69 tokens per second, delivering a 4.46× to 7.17× speedup over conventional speculative decoding approaches.
📝 Abstract
This work presents a 55nm speculative decoding-based LLM accelerator with bumping-based face-to-face ReRAM-on-logic stacking technology. It features a local rotation unit for outlier-free low-bit quantization, a stacking-aware PNM architecture co-designed with blockwise vector quantization to reduce weight EMA overheads, and an adaptive parallel speculative decoding scheme with an out-of-order scheduler for high resource and bandwidth utilization. Our chip achieves 14.08-to-135.69token/s and 4.46-to-7.17x speedup over vanilla speculative decoding.
Problem

Research questions and friction points this paper is trying to address.

Large Language Model
Outlier-Free Quantization
Weight Compression
Speculative Decoding
ReRAM-on-Logic
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReRAM-on-Logic stacking
outlier-free quantization
blockwise vector quantization
adaptive parallel speculative decoding
out-of-order scheduler
Pingcheng Dong
Pingcheng Dong
Hong Kong University of Science and Technology
AI ChipModel CompressionHW/SW Co-Design
Yonghao Tan
Yonghao Tan
The Hong Kong University of Science and Technology
AI AcceleratorComputer VisionVLSI
X
Xuejiao Liu
AI Chip Center for Emerging Smart System, Hong Kong, China
Peng Luo
Peng Luo
MIT
Spatial Data ScienceSpatial StatisticsSpatial AnalysisGeoAIGIScience
Yu Liu
Yu Liu
Assistant Professor, Department of Computing, Hong Kong Polytechnic University
Edge AIDistributed Quantum Computing
D
Di Pang
AI Chip Center for Emerging Smart System, Hong Kong, China
S
Songchen Ma
Hong Kong University of Science and Technology, Hong Kong, China, AI Chip Center for Emerging Smart System, Hong Kong, China
Xijie Huang
Xijie Huang
Hong Kong University of Science and Technology
Efficient Deep LearningModel Compression
Shih-Yang Liu
Shih-Yang Liu
PhD Student @ HKUST, NVIDIA Research
Efficient Deep Learning
D
Dong Zhang
Hong Kong University of Science and Technology, Hong Kong, China, AI Chip Center for Emerging Smart System, Hong Kong, China
Zhichao Lu
Zhichao Lu
City University of Hong Kong
Evolutionary ComputationBilevel OptimizationNeural Architecture Search
L
Luhong Liang
AI Chip Center for Emerging Smart System, Hong Kong, China
C
Chi-Ying Tsui
Hong Kong University of Science and Technology, Hong Kong, China, AI Chip Center for Emerging Smart System, Hong Kong, China
Fengbin Tu
Fengbin Tu
Assistant Professor at HKUST
AI ChipComputing-in-MemoryComputer ArchitectureReconfigurable Computing
L
Liang Zhao
Zhejiang University, Hangzhou, China
K
Kwang-Ting Cheng
Hong Kong University of Science and Technology, Hong Kong, China, AI Chip Center for Emerging Smart System, Hong Kong, China