NEZHA: A Zero-sacrifice and Hyperspeed Decoding Architecture for Generative Recommendations

πŸ“… 2025-11-24
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Generative recommender systems (GRS) suffer from high inference latency due to large language model (LLM) autoregressive decoding, hindering real-time, high-throughput deployment. Method: We propose an ultra-fast decoding architecture that requires no auxiliary models: (1) integrating a lightweight autoregressive draft head *within* the main LLM for zero-overhead β€œself-drafting”; (2) designing a model-free verifier based on hash sets to suppress hallucinations and eliminate dependencies on separate draft or verification models; and (3) adopting a sequence-to-sequence prompting structure to ensure generation completeness. Contribution/Results: Our approach achieves significant inference speedup on public benchmarks while preserving recommendation quality. Deployed on Taobao in October 2025, it serves hundreds of millions of users daily in real time and drives billions of RMB in advertising revenue.

Technology Category

Application Category

πŸ“ Abstract
Generative Recommendation (GR), powered by Large Language Models (LLMs), represents a promising new paradigm for industrial recommender systems. However, their practical application is severely hindered by high inference latency, which makes them infeasible for high-throughput, real-time services and limits their overall business impact. While Speculative Decoding (SD) has been proposed to accelerate the autoregressive generation process, existing implementations introduce new bottlenecks: they typically require separate draft models and model-based verifiers, requiring additional training and increasing the latency overhead. In this paper, we address these challenges with NEZHA, a novel architecture that achieves hyperspeed decoding for GR systems without sacrificing recommendation quality. Specifically, NEZHA integrates a nimble autoregressive draft head directly into the primary model, enabling efficient self-drafting. This design, combined with a specialized input prompt structure, preserves the integrity of sequence-to-sequence generation. Furthermore, to tackle the critical problem of hallucination, a major source of performance degradation, we introduce an efficient, model-free verifier based on a hash set. We demonstrate the effectiveness of NEZHA through extensive experiments on public datasets and have successfully deployed the system on Taobao since October 2025, driving the billion-level advertising revenue and serving hundreds of millions of daily active users.
Problem

Research questions and friction points this paper is trying to address.

Reducing high inference latency in generative recommendation systems
Eliminating separate draft models and model-based verifiers
Addressing hallucination issues in speculative decoding methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates autoregressive draft head into primary model
Uses specialized input prompt for sequence generation
Implements model-free hash set verifier against hallucination
πŸ”Ž Similar Papers
No similar papers found.
Yejing Wang
Yejing Wang
City University of Hong Kong
S
Shengyu Zhou
Alibaba Group
J
Jinyu Lu
Alibaba Group
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics
Langming Liu
Langming Liu
PhD, City University of Hongkong
RecommendationLarge Language ModelsFederated Learning
M
Maolin Wang
City University of Hong Kong
W
Wenlin Zhang
City University of Hong Kong
F
Feng Li
Alibaba Group
W
Wenbo Su
Alibaba Group
P
Pengjie Wang
Alibaba Group
J
Jian Xu
Alibaba Group
X
Xiangyu Zhao
City University of Hong Kong