OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of long-form audio-visual question answering under low-resource settings, including high encoding costs, weak fine-grained retrieval, and the absence of proactive planning and end-to-end optimization. To tackle these issues, the authors propose an agent-based, fully multimodal reasoning framework that integrates image-audio retrieval-augmented generation, a multimodal agent recurrence mechanism, and cross-modal tool invocation. Furthermore, they introduce group relative policy optimization to jointly enhance both tool utilization efficiency and answer quality. The proposed method significantly outperforms existing approaches on the OmniVideoBench, WorldSense, and Daily-Omni benchmarks, and ablation studies confirm the effectiveness of each component in the framework.

Technology Category

Application Category

📝 Abstract
Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization.To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.
Problem

Research questions and friction points this paper is trying to address.

low-resource
long audio-video question answering
omnimodal reasoning
retrieval-augmented generation
end-to-end optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniRAG-Agent
agentic reasoning
retrieval-augmented generation
multimodal QA
low-resource optimization
🔎 Similar Papers
No similar papers found.