🤖 AI Summary
To address the prohibitive computational overhead in multimodal understanding and generation caused by interleaved multimodal tokens—exacerbating diffusion denoising and autoregressive decoding—this paper proposes the first unified acceleration framework. Our method integrates a divide-and-conquer strategy, speculative decoding, and multi-stage knowledge distillation—including adversarial distillation and human-feedback-driven optimization—to jointly accelerate both understanding and generation. Crucially, it enables lossless high-frequency interaction and real-time editing without quality degradation. Experiments demonstrate over 2× speedup on multimodal understanding tasks and up to 22× acceleration on generation tasks. A 1-NFE (number of function evaluations) diffusion model achieves near real-time interactivity while preserving output fidelity. This work establishes a scalable, general-purpose acceleration paradigm for efficient multimodal systems in the era of large foundation models.
📝 Abstract
Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.