Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Building AGI-oriented multimodal perception models that unify auditory, visual, and linguistic capabilities—supporting arbitrary combinations of audio, image, video, and text inputs—and enabling end-to-end generation of either audio or text remains a fundamental challenge. Method: We pioneer a vision-language model (VLM)-based foundation for pretraining a unified multimodal architecture; introduce a ViT-LLM hybrid backbone, multi-stage modality-alignment distillation, tri-modal joint masked modeling, and a scene-driven speech synthesis pipeline; and construct Nexus-O-audio, a dedicated audio evaluation benchmark. Contribution/Results: Our model achieves cross-modal latent-space alignment for joint audio-visual-language representation. It significantly outperforms state-of-the-art methods on multimodal understanding, cross-modal generation, and robust ASR. Empirical evaluation demonstrates strong generalization in tri-modal reasoning and supports industrial-grade real-time interactive applications.

Technology Category

Application Category

📝 Abstract
Human beings perceive the real world through a spectrum of sensory modalities, encompassing auditory, visual, and linguistic faculties. The journey towards achieving Artificial General Intelligence (AGI) necessitates the development of models that can emulate these multifaceted perceptual capabilities and comprehensively understand these diversified data. To this end, we introduce extbf{Nexus-O}, an industry-level extbf{omni-perceptive and -interactive} model capable of efficiently processing Audio, Image, Video, and Text data in any combination and output audio/text in an end-to-end way. We systematically investigate Nexus-O by addressing three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario speech datasets? For the first question, we design and pre-train Nexus-O based on the vision-language model, rather than the language model. By pre-training the model over high-quality synthetic audio data, our model is capable of tri-modal perception and interaction. For the second question, we introduce a new audio testbed, Nexus-O-audio, comprising diverse Automatic Speech Recognition (ASR) samples, spanning various real-world scenarios, such as corporate meetings and live stream. For the third question, we design the speech data synthesis pipeline to obtain high-quality speech training datasets, covering various real-world scenarios. Comprehensive experimentation and an in-depth analysis of tri-modal alignment over latent space demonstrate the advantages of our model on downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Develop a model for tri-modal alignment across audio, image, and text.
Evaluate model robustness in real-world scenarios using diverse datasets.
Create high-quality speech datasets for training in various real-life contexts.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Omni-perceptive model for audio, image, video, text.
Pre-trained on synthetic audio for tri-modal perception.
New audio testbed for real-world scenario robustness.
🔎 Similar Papers
No similar papers found.
Che Liu
Che Liu
Imperial College London
Multimodal learningAI4Medicine
Yingji Zhang
Yingji Zhang
University of Manchester
Computational LinguisticsRepresentation LearningDisentanglementMulti-modal Learning
D
Dong Zhang
HiThink Research, China
Weijie Zhang
Weijie Zhang
University of Kansas Medical Center
Inverse planningparticle therapy
C
Chenggong Gong
HiThink Research, China
H
Haohan Li
HiThink Research, China
Y
Yu Lu
HiThink Research, China
Shilin Zhou
Shilin Zhou
School of Computer Science and Technology, Soochow University
Machine LearningNatural Language Processing
Y
Yue Lu
HiThink Research, China
Z
Ziliang Gan
HiThink Research, China
Z
Ziao Wang
Baptist University, HK
J
Junwei Liao
Microsoft, USA
H
Haipang Wu
HiThink Research, China
J
Ji Liu
HiThink Research, China
A
André Freitas
University of Manchester, UK; Idiap Research Institute, Switzerland
Q
Qifan Wang
Meta AI, USA
Zenglin Xu
Zenglin Xu
Fudan University
Machine LearningTrustworthy AIFederated LearningLarge Language ModelsTime Series Analysis
R
Rongjuncheng Zhang
HiThink Research, China; Zhejiang University, China
Y
Yong Dai
HiThink Research, China; Fudan University, China