🤖 AI Summary
This work addresses the challenge of joint distribution modeling in multimodal arbitrary-to-arbitrary generation (e.g., text→image, text→audio, audio→image). We propose Multimodal Rectified Flow (MRF), a novel framework built upon rectified flow. Methodologically, we introduce the first multimodal rectified flow formulation; design a plug-and-play modular architecture enabling flexible cross-modal alignment and joint generation across text, image, and audio; develop a modality-conditional guidance mechanism; and adopt a two-stage strategy—pretraining followed by fusion fine-tuning. Furthermore, we systematically characterize architectural principles for rectified flow Transformers in large-model regimes. Experiments demonstrate that MRF significantly outperforms existing arbitrary-to-arbitrary generative models on text→image and text→audio tasks, establishing an efficient, controllable, and scalable unified multimodal generation paradigm.
📝 Abstract
We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.