🤖 AI Summary
Diffusion-based multimodal large language models (MLLMs) exhibit strong non-autoregressive multimodal generation capabilities, yet their denoising process requires full bidirectional self-attention over thousands of visual tokens, resulting in O(L³) decoding complexity and severely hampering inference efficiency. To address this, we propose Dynamic Token Merging (DTM): leveraging decision tokens from prior denoising steps to construct an importance map, DTM adaptively retains salient visual tokens while aggregating redundant ones. It employs a lightweight, similarity-based token merging mechanism operating within a single Transformer layer, requiring no architectural modifications. Naturally aligned with the diffusion process, DTM is plug-and-play. Experiments demonstrate that DTM significantly accelerates inference—while preserving both image generation fidelity and multimodal understanding performance—outperforming existing acceleration methods under equivalent computational budgets. The code is publicly available.
📝 Abstract
Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance. The code is released at https://github.com/bcmi/D3ToM-Diffusion-MLLM.