đ¤ AI Summary
Autoregressive models suffer from high real-time inference latency due to sequential dependency, and conventional compression techniquesâsuch as pruning and quantizationâoften incur significant accuracy degradation. To address this, we propose a unified generationârefinement decoding framework. Methodologically, we first establish a taxonomy of generation strategiesâincluding n-gram matching and draft-model-based approachesâand refinement mechanismsâspanning single-step verification and iterative optimization. The framework integrates speculative decoding, multi-round verification, knowledge distillation from draft models, and hardware-aware scheduling for efficient deployment across heterogeneous platforms. Evaluated on text, image, and speech generation tasks, our approach achieves an average 2.1Ă speedup in end-to-end latency while sustaining minimal accuracy loss (<0.5% in BLEU, CLIP, and FID metrics). This work provides both scalable theoretical foundations and system-level implementation strategies for real-time large language and multimodal model applications.
đ Abstract
Sequential dependencies present a fundamental bottleneck in deploying large-scale autoregressive models, particularly for real-time applications. While traditional optimization approaches like pruning and quantization often compromise model quality, recent advances in generation-refinement frameworks demonstrate that this trade-off can be significantly mitigated. This survey presents a comprehensive taxonomy of generation-refinement frameworks, analyzing methods across autoregressive sequence tasks. We categorize methods based on their generation strategies (from simple n-gram prediction to sophisticated draft models) and refinement mechanisms (including single-pass verification and iterative approaches). Through systematic analysis of both algorithmic innovations and system-level implementations, we examine deployment strategies across computing environments and explore applications spanning text, images, and speech generation. This systematic examination of both theoretical frameworks and practical implementations provides a foundation for future research in efficient autoregressive decoding.