🤖 AI Summary
Current unified multimodal models predominantly rely on autoregressive architectures, which struggle to jointly optimize perception and generation capabilities. Moreover, hybrid or decoupled designs suffer from redundancy and poor generalization, limiting performance on cross-modal retrieval and multi-turn interactive tasks. To address these limitations, we propose a unified multimodal modeling paradigm based on discrete flow matching: we construct probabilistic paths in a metric space and introduce kinetic-energy-optimal velocity estimation, enabling native bidirectional translation across arbitrary modalities and seamless multi-turn interaction. The model is jointly trained on interleaved text, image, video, and audio data. It achieves state-of-the-art performance across comprehensive multimodal understanding and generation benchmarks, with particularly notable gains in cross-modal retrieval and multi-turn dialogue. All code, data protocols, and model weights are publicly released.
📝 Abstract
Next-generation multimodal foundation models capable of any-to-any cross-modal generation and multi-turn interaction will serve as core components of artificial general intelligence systems, playing a pivotal role in human-machine interaction. However, most existing multimodal models remain constrained by autoregressive architectures, whose inherent limitations prevent a balanced integration of understanding and generation capabilities. Although hybrid and decoupling strategies have been explored to address these tasks within unified frameworks separately, their redundant, non-integrated designs limit their applicability to broader scenarios, such as cross-modal retrieval.In this work, we introduce NExT-OMNI, an open-source omnimodal foundation model that achieves unified modeling through discrete flow paradigms. By leveraging metric-induced probability paths and kinetic optimal velocities, NExT-OMNI natively supports any-to-any understanding and generation with enhanced response efficiency, while enabling broader application scenarios through concise unified representations rather than task-decoupled designs. Trained on large-scale interleaved text, image, video, and audio data, NExT-OMNI delivers competitive performance on multimodal generation and understanding benchmarks, while outperforming prior unified models in multi-turn multimodal interaction and cross-modal retrieval, highlighting its architectural advantages as a next-generation multimodal foundation model. To advance further research, we release training details, data protocols, and open-source both the code and model checkpoints.