OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenges of long-form audio-visual question answering under low-resource settings, including high encoding costs, weak fine-grained retrieval, and the absence of proactive planning and end-to-end optimization. To tackle these issues, the authors propose an agent-based, fully multimodal reasoning framework that integrates image-audio retrieval-augmented generation, a multimodal agent recurrence mechanism, and cross-modal tool invocation. Furthermore, they introduce group relative policy optimization to jointly enhance both tool utilization efficiency and answer quality. The proposed method significantly outperforms existing approaches on the OmniVideoBench, WorldSense, and Daily-Omni benchmarks, and ablation studies confirm the effectiveness of each component in the framework.

Technology Category

Application Category

📝 Abstract

Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end optimization.To address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.

Problem

Research questions and friction points this paper is trying to address.

low-resource

long audio-video question answering

omnimodal reasoning

retrieval-augmented generation

end-to-end optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniRAG-Agent

agentic reasoning

retrieval-augmented generation