OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

📅 2024-12-02

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the challenge of joint distribution modeling in multimodal arbitrary-to-arbitrary generation (e.g., text→image, text→audio, audio→image). We propose Multimodal Rectified Flow (MRF), a novel framework built upon rectified flow. Methodologically, we introduce the first multimodal rectified flow formulation; design a plug-and-play modular architecture enabling flexible cross-modal alignment and joint generation across text, image, and audio; develop a modality-conditional guidance mechanism; and adopt a two-stage strategy—pretraining followed by fusion fine-tuning. Furthermore, we systematically characterize architectural principles for rectified flow Transformers in large-model regimes. Experiments demonstrate that MRF significantly outperforms existing arbitrary-to-arbitrary generative models on text→image and text→audio tasks, establishing an efficient, controllable, and scalable unified multimodal generation paradigm.

Technology Category

Application Category

📝 Abstract

We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

Problem

Research questions and friction points this paper is trying to address.

Extends rectified flow for multi-modal any-to-any generation tasks

Proposes novel architecture for text, image, and audio synthesis

Optimizes rectified flow transformers for diverse modality performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends RF to multi-modal with novel guidance

Proposes MMDiT extension for audio and text

Optimizes rectified flow transformers for diverse modalities

🔎 Similar Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation