Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything

📅 2025-11-04

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing multimodal large language models (MLLMs) are constrained by fixed modality combinations and reliance on extensive aligned training data, limiting their capacity for unified understanding and complex cross-modal reasoning across text, images, audio, and video. To address this, we propose a Master-Agent architecture—a fine-tuning-free, multi-model collaboration framework. The master agent dynamically decomposes tasks, orchestrates modality-specific agents (e.g., LLMs, vision-language models, audio-language models), and fuses their outputs to enable end-to-end, interpretable, and scalable joint multimodal inference. This approach eliminates rigid modality-pair constraints and substantially enhances cross-modal comprehension and generation capabilities. Evaluated on comprehensive multimodal benchmarks—including MMBench, VideoMME, and AudioMM—our method achieves state-of-the-art performance, demonstrating both effectiveness and strong generalization across diverse modalities and tasks.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available.

Problem

Research questions and friction points this paper is trying to address.

Enabling flexible multimodal reasoning without model retraining

Overcoming limitations of fixed modality pairs in MLLMs

Integrating diverse inputs like text, images, audio and video

Innovation

Methods, ideas, or system contributions that make the work stand out.

Coordinates existing foundation models via master-agent system

Enables flexible multimodal reasoning without retraining

Integrates specialized models for cross-modal tasks

🔎 Similar Papers

AllSpark: A Multimodal Spatio-Temporal General Intelligence Model with Ten Modalities via Language as a Reference Framework