PRISM: Parametrically Refactoring Inference for Speculative Sampling Draft Models

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high inference latency of large language models caused by autoregressive decoding and the substantial computational overhead introduced by existing speculative decoding methods when enhancing draft model quality. The authors propose PRISM, an architecture that decouples model parameters and restructures computational pathways to distribute each prediction step across distinct parameter subsets, thereby disentangling model capacity from inference cost. PRISM significantly increases accepted sequence length while maintaining low draft latency and demonstrates superior data-efficiency scaling properties. When integrated into a highly optimized inference engine, PRISM achieves over 2.6× end-to-end decoding throughput improvement, outperforming all existing draft model architectures.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs), constrained by their auto-regressive nature, suffer from slow decoding. Speculative decoding methods have emerged as a promising solution to accelerate LLM decoding, attracting attention from both systems and AI research communities. Recently, the pursuit of better draft quality has driven a trend toward parametrically larger draft models, which inevitably introduces substantial computational overhead. While existing work attempts to balance the trade-off between prediction accuracy and compute latency, we address this fundamental dilemma through architectural innovation. We propose PRISM, which disaggregates the computation of each predictive step across different parameter sets, refactoring the computational pathways of draft models to successfully decouple model capacity from inference cost. Through extensive experiments, we demonstrate that PRISM outperforms all existing draft architectures, achieving exceptional acceptance lengths while maintaining minimal draft latency for superior end-to-end speedup. We also re-examine scaling laws with PRISM, revealing that PRISM scales more effectively with expanding data volumes than other draft architectures. Through rigorous and fair comparison, we show that PRISM boosts the decoding throughput of an already highly optimized inference engine by more than 2.6x.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
large language models
decoding acceleration
computational overhead
draft models
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
draft model
computational disaggregation
inference acceleration
scaling laws
🔎 Similar Papers
No similar papers found.
X
Xuliang Wang
David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada; Central China Institute of Artificial Intelligence, Zhengzhou, Henan, China
Y
Yuetao Chen
The Chinese University of Hong Kong, Hong Kong SAR, China
M
Maochan Zhen
Central China Institute of Artificial Intelligence, Zhengzhou, Henan, China
F
Fang Liu
Central China Institute of Artificial Intelligence, Zhengzhou, Henan, China
X
Xinzhou Zheng
University of Science and Technology of China, Hefei, Anhui, China
X
Xingwu Liu
Dalian University of Technology, Dalian, Liaoning, China; Central China Institute of Artificial Intelligence, Zhengzhou, Henan, China
Hong Xu
Hong Xu
Associate Professor, Computer Science and Engineering, Chinese University of Hong Kong
SystemsNetworking
Ming Li
Ming Li
University Professor, University of Waterloo
Bioinformaticsmachine learningKolmogorov complexityfew shot learninginformation distance