🤖 AI Summary
Existing mask generation methods struggle with alignment difficulties and training instability in multimodal settings, hindering unified generative modeling of discrete (e.g., text) and continuous (e.g., image) data. To address this, this work proposes the CoM-DAD framework, which introduces a hierarchical dual-process generative mechanism: it first models the cross-modal semantic manifold via continuous latent diffusion, then leverages this semantic representation as a prior to generate concrete tokens through a discrete absorbing diffusion process with variable-rate noise scheduling. The approach innovatively integrates coupled manifold-aware discrete absorbing diffusion, adaptive noise scheduling, and a stochastic mixed-modality transfer strategy, achieving efficient cross-modal alignment without relying on heavy contrastive dual encoders. Experiments demonstrate that the method significantly enhances training stability and achieves superior generation quality and semantic coherence in unified text-to-image synthesis, offering a scalable new paradigm for multimodal generation.
📝 Abstract
The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.