A Skill-augmented Agentic Framework and Benchmark for Multi-Video Understanding

📅 2026-03-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to multi-video understanding suffer from training-inference inconsistency, information loss due to frame compression, and a lack of explicit cross-video coordination. Moreover, current evaluation benchmarks are limited to event-level comparisons, which are insufficient for tasks requiring identity matching, fine-grained discrimination, and multi-step structured reasoning. To address these limitations, this work introduces MVX-Bench, a unified multi-video question answering framework, and proposes SAMA, a skill-augmented agent architecture that reformulates eleven classical vision tasks into a multi-video QA format for the first time. SAMA integrates a multimodal large language model, visual tools, task-specific skill modules, and a conflict-aware verification mechanism to enable iterative structured reasoning. Experiments demonstrate that SAMA significantly outperforms both open-source baselines and GPT on MVX-Bench, with ablation studies confirming the effectiveness of its skill design and conflict-resolution mechanisms.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models have achieved strong performance in single-video understanding, yet their ability to reason across multiple videos remains limited. Existing approaches typically concatenate multiple videos into a single input and perform direct inference, which introduces training-inference mismatch, information loss from frame compression, and a lack of explicit cross-video coordination. Meanwhile, current multi-video benchmarks primarily emphasize event-level comparison, leaving identity-level matching, fine-grained discrimination, and structured multi-step reasoning underexplored. To address these gaps, we introduce MVX-Bench, a Multi-Video Cross-Dimension Benchmark that reformulates 11 classical computer vision tasks into a unified multi-video question-answering framework, comprising 1,442 questions over 4,255 videos from diverse real-world datasets. We further propose SAMA, a Skill-Augmented Agentic Framework for Multi-Video Understanding, which integrates visual tools, task-specific skills, and a conflict-aware verification mechanism to enable iterative and structured reasoning. Experimental results show that SAMA outperforms strong open-source baselines and GPT on MVX-Bench, and ablations validate the effectiveness of skill design and conflict resolution.
Problem

Research questions and friction points this paper is trying to address.

multi-video understanding
cross-video reasoning
multimodal large language models
video benchmark
fine-grained discrimination
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-video understanding
skill-augmented agentic framework
structured reasoning
conflict-aware verification
multimodal benchmark
🔎 Similar Papers
No similar papers found.