Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

📅 2025-11-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) are constrained by fixed modality combinations and reliance on extensive aligned training data, limiting their capacity for unified understanding and complex cross-modal reasoning across text, images, audio, and video. To address this, we propose a Master-Agent architecture—a fine-tuning-free, multi-model collaboration framework. The master agent dynamically decomposes tasks, orchestrates modality-specific agents (e.g., LLMs, vision-language models, audio-language models), and fuses their outputs to enable end-to-end, interpretable, and scalable joint multimodal inference. This approach eliminates rigid modality-pair constraints and substantially enhances cross-modal comprehension and generation capabilities. Evaluated on comprehensive multimodal benchmarks—including MMBench, VideoMME, and AudioMM—our method achieves state-of-the-art performance, demonstrating both effectiveness and strong generalization across diverse modalities and tasks.

Technology Category

Application Category

📝 Abstract
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available.
Problem

Research questions and friction points this paper is trying to address.

Enabling flexible multimodal reasoning without model retraining
Overcoming limitations of fixed modality pairs in MLLMs
Integrating diverse inputs like text, images, audio and video
Innovation

Methods, ideas, or system contributions that make the work stand out.

Coordinates existing foundation models via master-agent system
Enables flexible multimodal reasoning without retraining
Integrates specialized models for cross-modal tasks
🔎 Similar Papers
No similar papers found.
Huawei Lin
Huawei Lin
Ph.D. Student, Rochester Institute of Technology
Generative AILLMsDMsScalable MLTrustworthy ML
Y
Yunzhi Shi
Amazon
T
Tong Geng
University of Rochester
W
Weijie Zhao
Rochester Institute of Technology
W
Wei Wang
Amazon
R
Ravender Pal Singh
Amazon