OmniGAIA: Towards Native Omni-Modal AI Agents

📅 2026-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models are largely confined to bimodal interactions and lack the capacity for unified understanding across video, audio, images, and language, as well as advanced reasoning and tool-use capabilities. To address this limitation, this work introduces OmniAtlas, the first natively designed full-modal agent framework, alongside OmniGAIA, a comprehensive evaluation benchmark. The framework generates multi-hop cross-modal tasks via full-modal event graphs and integrates a tool-augmented reasoning paradigm, a backtracking-guided tree exploration strategy, and OmniDPO—a fine-grained error-correction technique—to enable proactive perception and multi-turn collaborative tool execution. Experimental results demonstrate that the proposed approach significantly enhances the reasoning and tool-utilization performance of open-source models on complex cross-modal tasks, advancing the practical deployment of full-modal AI assistants in real-world scenarios.

Technology Category

Application Category

📝 Abstract
Human intelligence naturally intertwines omni-modal perception -- spanning vision, audio, and language -- with complex reasoning and tool usage to interact with the world. However, current multi-modal LLMs are primarily confined to bi-modal interactions (e.g., vision-language), lacking the unified cognitive capabilities required for general AI assistants. To bridge this gap, we introduce OmniGAIA, a comprehensive benchmark designed to evaluate omni-modal agents on tasks necessitating deep reasoning and multi-turn tool execution across video, audio, and image modalities. Constructed via a novel omni-modal event graph approach, OmniGAIA synthesizes complex, multi-hop queries derived from real-world data that require cross-modal reasoning and external tool integration. Furthermore, we propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception. Trained on trajectories synthesized via a hindsight-guided tree exploration strategy and OmniDPO for fine-grained error correction, OmniAtlas effectively enhances the tool-use capabilities of existing open-source models. This work marks a step towards next-generation native omni-modal AI assistants for real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

omni-modal
multi-modal LLMs
complex reasoning
tool usage
cross-modal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

omni-modal
tool-integrated reasoning
event graph
OmniDPO
multi-hop reasoning
🔎 Similar Papers
No similar papers found.