🤖 AI Summary
Existing multimodal large language models (MLLMs) are constrained by fixed modality combinations and reliance on extensive aligned training data, limiting their capacity for unified understanding and complex cross-modal reasoning across text, images, audio, and video. To address this, we propose a Master-Agent architecture—a fine-tuning-free, multi-model collaboration framework. The master agent dynamically decomposes tasks, orchestrates modality-specific agents (e.g., LLMs, vision-language models, audio-language models), and fuses their outputs to enable end-to-end, interpretable, and scalable joint multimodal inference. This approach eliminates rigid modality-pair constraints and substantially enhances cross-modal comprehension and generation capabilities. Evaluated on comprehensive multimodal benchmarks—including MMBench, VideoMME, and AudioMM—our method achieves state-of-the-art performance, demonstrating both effectiveness and strong generalization across diverse modalities and tasks.
📝 Abstract
Multimodal large language models (MLLMs) have shown strong capabilities but remain limited to fixed modality pairs and require costly fine-tuning with large aligned datasets. Building fully omni-capable models that can integrate text, images, audio, and video remains impractical and lacks robust reasoning support. In this paper, we propose an Agent-Omni framework that coordinates existing foundation models through a master-agent system, enabling flexible multimodal reasoning without retraining. The master agent interprets user intent, delegates subtasks to modality-specific agents, and integrates their outputs into coherent responses. Extensive experiments across text, image, audio, video, and omni benchmarks show that Agent-Omni consistently achieves state-of-the-art performance, particularly on tasks requiring complex cross-modal reasoning. Its agent-based design enables seamless integration of specialized foundation models, ensuring adaptability to diverse inputs while maintaining transparency and interpretability. In addition, the framework is modular and easily extensible, allowing future improvements as stronger models become available.