D$^{3}$ToM: Decider-Guided Dynamic Token Merging for Accelerating Diffusion MLLMs

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based multimodal large language models (MLLMs) exhibit strong non-autoregressive multimodal generation capabilities, yet their denoising process requires full bidirectional self-attention over thousands of visual tokens, resulting in O(L³) decoding complexity and severely hampering inference efficiency. To address this, we propose Dynamic Token Merging (DTM): leveraging decision tokens from prior denoising steps to construct an importance map, DTM adaptively retains salient visual tokens while aggregating redundant ones. It employs a lightweight, similarity-based token merging mechanism operating within a single Transformer layer, requiring no architectural modifications. Naturally aligned with the diffusion process, DTM is plug-and-play. Experiments demonstrate that DTM significantly accelerates inference—while preserving both image generation fidelity and multimodal understanding performance—outperforming existing acceleration methods under equivalent computational budgets. The code is publicly available.

Technology Category

Application Category

📝 Abstract
Diffusion-based multimodal large language models (Diffusion MLLMs) have recently demonstrated impressive non-autoregressive generative capabilities across vision-and-language tasks. However, Diffusion MLLMs exhibit substantially slower inference than autoregressive models: Each denoising step employs full bidirectional self-attention over the entire sequence, resulting in cubic decoding complexity that becomes computationally impractical with thousands of visual tokens. To address this challenge, we propose D$^{3}$ToM, a Decider-guided dynamic token merging method that dynamically merges redundant visual tokens at different denoising steps to accelerate inference in Diffusion MLLMs. At each denoising step, D$^{3}$ToM uses decider tokens-the tokens generated in the previous denoising step-to build an importance map over all visual tokens. Then it maintains a proportion of the most salient tokens and merges the remainder through similarity-based aggregation. This plug-and-play module integrates into a single transformer layer, physically shortening the visual token sequence for all subsequent layers without altering model parameters. Moreover, D$^{3}$ToM employs a merge ratio that dynamically varies with each denoising step, aligns with the native decoding process of Diffusion MLLMs, achieving superior performance under equivalent computational budgets. Extensive experiments show that D$^{3}$ToM accelerates inference while preserving competitive performance. The code is released at https://github.com/bcmi/D3ToM-Diffusion-MLLM.
Problem

Research questions and friction points this paper is trying to address.

Accelerating slow inference in Diffusion MLLMs
Reducing computational complexity from visual tokens
Dynamically merging redundant tokens during denoising
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamically merges redundant visual tokens during denoising
Uses decider tokens to build importance maps for tokens
Employs dynamic merge ratio varying with each denoising step
🔎 Similar Papers
No similar papers found.
S
Shuochen Chang
MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
X
Xiaofeng Zhang
MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Q
Qingyang Liu
MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University
Li Niu
Li Niu
Shanghai Jiao Tong University
computer visionmachine learningdeep learning