🤖 AI Summary
This work addresses the limitations of multimodal diffusion models in structured reasoning tasks—such as text-to-image generation—where continuous visual representations hinder the effective transfer of recursive and implicit reasoning capabilities from language models. Inspired by human modular cognition, the authors introduce a recurrent sparse mixture-of-experts (MoE) framework into the joint attention layers of diffusion models. A gating network dynamically selects expert modules, enabling iterative refinement of hidden states within the diffusion latent space across multiple steps. This approach is the first to integrate recurrent sparse reasoning into multimodal diffusion latent variables, achieving parameter-efficient sharing and dynamic modular computation while overcoming the constraints imposed by the continuity of visual tokens on structured reasoning. Experiments demonstrate significant improvements in image generation quality and semantic consistency on ImageNet class-conditional generation, GenEval, and DPG benchmarks, validating the method’s effectiveness in enhancing multimodal reasoning.
📝 Abstract
Diffusion models have achieved success in high-fidelity data synthesis, yet their capacity for more complex, structured reasoning like text following tasks remains constrained. While advances in language models have leveraged strategies such as latent reasoning and recursion to enhance text understanding capabilities, extending these to multimodal text-to-image generation tasks is challenging due to the continuous and non-discrete nature of visual tokens. To tackle this problem, we draw inspiration from modular human cognition and propose a recursive, sparse mixture-of-experts framework integrated into conventional diffusion models. Our approach introduces a recursive component within joint attention layers that iteratively refines visual tokens over multiple latent steps while efficiently sharing parameters via sparse selection of neural modules. At each step, a gating network is devised to dynamically select specialized neural modules, conditioned on the current visual tokens, the diffusion timestep, and the conditioning information. Comprehensive evaluation on class-conditioned ImageNet image generation tasks and additional studies on the GenEval and DPG benchmark demonstrate the superiority of the proposed method in enhancing model image generation performance.