OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

📅 2025-10-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of constructing open-source large language models (LLMs) capable of processing arbitrary modalities. We propose OmniAlignNet, a novel architecture that integrates temporal embedding grouping with constrained rotational time embeddings to achieve precise cross-modal alignment and robust temporal modeling of visual, audio, and other modalities within a shared latent space. Methodologically, we introduce a unified multimodal latent-space alignment network coupled with hybrid relative and absolute temporal encoding, and develop a scalable data synthesis pipeline covering 24 million single- and multimodal dialogues. Trained on only 0.2 trillion tokens, OmniAlignNet surpasses Qwen2.5-Omni by +19.05, +1.7, and +3.9 points on the DailyOmni, MMAR, and Video-MME benchmarks, respectively. Extensive evaluations further demonstrate strong generalization in real-world applications—including robotics, medical AI, and smart manufacturing—validating its practical efficacy across diverse domains.

Technology Category

Application Category

📝 Abstract
Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.
Problem

Research questions and friction points this paper is trying to address.

Enhancing omni-modal understanding across vision and audio
Developing innovative architecture for temporal alignment in multimodal data
Creating efficient training with reduced tokens while improving performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniAlignNet aligns vision and audio embeddings in shared space
Temporal Embedding Grouping captures relative temporal alignment
Constrained Rotary Time Embedding encodes absolute temporal information
🔎 Similar Papers
No similar papers found.