🤖 AI Summary
Multimodal large language model pretraining faces escalating data and computational costs. To address this, we propose Mixture-of-Transformers (MoT), a modality-aware sparse multimodal Transformer architecture: it decouples non-embedding parameters—including feed-forward networks, attention weights, and LayerNorm—by modality to enable joint modeling of text, images, and speech while preserving full-sequence global self-attention. We introduce the first modality-conditioned sparse routing mechanism, rendering FLOPs nearly independent of the number of modalities. Experiments show that MoT achieves comparable performance to dense baselines on Chameleon-7B using only 55.8% of the FLOPs (as low as 37.2% for speech tasks); in Transfusion, a 7B MoT matches image-generation performance at one-third the FLOPs, and a 760M MoT even surpasses a 1.4B dense model. Empirical training wall-clock time reductions reach 47.2% (images) and 75.6% (text).
📝 Abstract
The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality -- including feed-forward networks, attention matrices, and layer normalization -- enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline's performance using only 55.8% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT's practical benefits, achieving dense baseline image quality in 47.2% of the wall-clock time and text quality in 75.6% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).