The Design Space of Tri-Modal Masked Diffusion Models

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

This work addresses the limitation of existing diffusion models, which are largely confined to unimodal or bimodal settings and struggle to jointly model text, images, and audio. We propose the first trilingual masked discrete diffusion model, pretrained from scratch with 3 billion parameters on a dataset comprising 6.4 trillion tokens, enabling unified generation across all three modalities. Through systematic investigation of multimodal scaling laws, modality mixing ratios, noise scheduling, and batch size effects, we introduce an SDE-based reparameterization method that decouples physical and logical batch sizes, significantly simplifying hyperparameter tuning. Additionally, we design an efficient inference sampling strategy that achieves strong performance across text generation, text-to-image synthesis, and text-to-speech tasks, establishing the first comprehensive open benchmark for multimodal diffusion models.

Technology Category

Application Category

📝 Abstract

Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.

Problem

Research questions and friction points this paper is trying to address.

tri-modal

masked diffusion models

multimodal scaling laws

discrete diffusion

pretraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

tri-modal diffusion

masked diffusion model

multimodal scaling laws