A Benchmark and Agentic Framework for Omni-Modal Reasoning and Tool Use in Long Videos

📅 2025-12-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current long-video multimodal understanding lacks diagnostic benchmarks that simultaneously address temporal length and modality richness, while relying on oversimplified evaluation protocols that obscure failure modes. To bridge this gap, we introduce LongShOTBench—the first diagnostic benchmark for long-video understanding—comprising over one thousand human-verified samples and supporting open-ended intent recognition, multi-turn dialogue, and cross-modal tool invocation. We further propose LongShOTAgent, a novel intelligent agent framework integrating joint video-audio-speech preprocessing, semantic retrieval, and iterative reasoning. Its embodied multimodal architecture and traceable, hierarchical evaluation mechanism are unprecedented. Experiments show LongShOTAgent achieves 44.66% on LongShOTBench—significantly outperforming leading open-source multimodal large language models (MLLMs), all scoring below 30%—thereby exposing fundamental bottlenecks in long-horizon multimodal reasoning.

Technology Category

Application Category

📝 Abstract
Long-form multimodal video understanding requires integrating vision, speech, and ambient audio with coherent long-range reasoning. Existing benchmarks emphasize either temporal length or multimodal richness, but rarely both and while some incorporate open-ended questions and advanced metrics, they mostly rely on single-score accuracy, obscuring failure modes. We introduce LongShOTBench, a diagnostic benchmark with open-ended, intent-driven questions; single- and multi-turn dialogues; and tasks requiring multimodal reasoning and agentic tool use across video, audio, and speech. Each item includes a reference answer and graded rubric for interpretable, and traceable evaluation. LongShOTBench is produced via a scalable, human-validated pipeline to ensure coverage and reproducibility. All samples in our LongShOTBench are human-verified and corrected. Furthermore, we present LongShOTAgent, an agentic system that analyzes long videos via preprocessing, search, and iterative refinement. On LongShOTBench, state-of-the-art MLLMs show large gaps: Gemini-2.5-Flash achieves 52.95%, open-source models remain below 30%, and LongShOTAgent attains 44.66%. These results underscore the difficulty of real-world long-form video understanding. LongShOTBench provides a practical, reproducible foundation for evaluating and improving MLLMs. All resources are available on GitHub: https://github.com/mbzuai-oryx/longshot.
Problem

Research questions and friction points this paper is trying to address.

Benchmarking multimodal reasoning and tool use in long videos.
Addressing gaps in existing video understanding benchmarks and metrics.
Evaluating agentic systems for long-form multimodal video analysis.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diagnostic benchmark with open-ended questions and dialogues
Agentic system using preprocessing, search, and iterative refinement
Scalable human-validated pipeline ensuring coverage and reproducibility
M
Mohammed Irfan Kurpath
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
J
Jaseel Muhammad Kaithakkodan
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
J
Jinxing Zhou
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Sahal Shaji Mullappilly
Sahal Shaji Mullappilly
PhD Computer Vision Student, MBZUAI
Vision Language ModelsComputer VisionObject DetectionReal-time models
M
Mohammad Almansoori
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Noor Ahsan
Noor Ahsan
Research Engineer
Remote Sensing Geospatial Data Machine Learning Research Remote Sensing
B
Beknur Kalmakhanbet
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
S
Sambal Shikhar
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
R
Rishabh Lalla
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Jean Lahoud
Jean Lahoud
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer Vision
M
Mariette Awad
American University of Beirut
Fahad Shahbaz Khan
Fahad Shahbaz Khan
MBZUAI, Linköping University Sweden
Computer VisionObject RecognitionGenerative AIAI for Science
S
Salman Khan
Mohamed Bin Zayed University of Artificial Intelligence (MBZUAI)
Rao Muhammad Anwer
Rao Muhammad Anwer
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionObject Recognition
Hisham Cholakkal
Hisham Cholakkal
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computer VisionLarge Multimodal ModelsLLMHealthcare Foundation ModelConversational Assistant