🤖 AI Summary
This work addresses the challenge of constructing open-source large language models (LLMs) capable of processing arbitrary modalities. We propose OmniAlignNet, a novel architecture that integrates temporal embedding grouping with constrained rotational time embeddings to achieve precise cross-modal alignment and robust temporal modeling of visual, audio, and other modalities within a shared latent space. Methodologically, we introduce a unified multimodal latent-space alignment network coupled with hybrid relative and absolute temporal encoding, and develop a scalable data synthesis pipeline covering 24 million single- and multimodal dialogues. Trained on only 0.2 trillion tokens, OmniAlignNet surpasses Qwen2.5-Omni by +19.05, +1.7, and +3.9 points on the DailyOmni, MMAR, and Video-MME benchmarks, respectively. Extensive evaluations further demonstrate strong generalization in real-world applications—including robotics, medical AI, and smart manufacturing—validating its practical efficacy across diverse domains.
📝 Abstract
Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.