🤖 AI Summary
To address the imbalance between inference efficiency and generation fidelity caused by static residual transformations in autoregressive large language model (LLM) decoding, this paper proposes a multi-rate hybrid residual mechanism. First, it introduces the novel dimension of “residual evolution velocity” to model dynamic residual behavior, moving beyond conventional layer-distance-based static assumptions. Second, it designs token-level velocity prediction and multi-rate gating to enable early residual alignment. Third, it pioneers joint optimization of residual evolution velocity with MoE expert preloading, effectively mitigating expert-switching bottlenecks. Evaluated on MT-Bench, the method achieves a 2.8× decoding speedup over state-of-the-art approaches such as Medusa; under MoE configurations, acceleration reaches 2.9×. It significantly improves the latency-quality trade-off for models including Koala and WizardLM.
📝 Abstract
Residual transformations enhance the representational depth and expressive power of large language models (LLMs). However, applying static residual transformations across all tokens in auto-regressive generation leads to a suboptimal trade-off between inference efficiency and generation fidelity. Existing methods, including Early Exiting, Skip Decoding, and Mixture-of-Depth address this by modulating the residual transformation based on token-level complexity. Nevertheless, these approaches predominantly consider the distance traversed by tokens through the model layers, neglecting the underlying velocity of residual evolution. We introduce Mixture of Multi-rate Residuals (M2R2), a framework that dynamically modulates residual velocity to improve early alignment, enhancing inference efficiency. Evaluations on reasoning oriented tasks such as Koala, Self-Instruct, WizardLM, and MT-Bench show M2R2 surpasses state-of-the-art distance-based strategies, balancing generation quality and speedup. In self-speculative decoding setup, M2R2 achieves up to 2.8x speedups on MT-Bench, outperforming methods like 2-model speculative decoding, Medusa, LookAhead Decoding, and DEED. In Mixture-of-Experts (MoE) architectures, integrating early residual alignment with ahead-of-time expert loading into high-bandwidth memory (HBM) accelerates decoding, reduces expert-switching bottlenecks, and achieves a 2.9x speedup, making it highly effective in resource-constrained environments.