SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Long-video understanding faces challenges including content complexity, temporal dispersion, and high computational overhead; existing frame selection methods often neglect temporal dependencies or suffer from unimodal limitations, leading to unfocused and inconsistent Video-LLM inference. To address this, we propose a training-free, generalizable semantic-visual evidence consensus framework: it employs LLM-driven temporal-aware semantic reasoning and mutual information-guided visual embedding alignment for dual-branch keyframe selection; additionally, it introduces answer-space constraints and evidence fusion to enforce consensus optimization, mitigating cross-modal prediction bias. Our method achieves significant improvements over state-of-the-art approaches across multiple long-video understanding benchmarks, demonstrating both higher accuracy and robustness. This work is the first to empirically validate that multimodal evidence consensus—rather than isolated modality cues—can effectively enhance long-video reasoning quality through principled frame selection.

Technology Category

Application Category

📝 Abstract

Long video understanding remains challenging due to its complex, diverse, and temporally scattered content. Although video large language models (Video-LLMs) can process videos lasting tens of minutes, applying them to truly long sequences is computationally prohibitive and often leads to unfocused or inconsistent reasoning. A promising solution is to select only the most informative frames, yet existing approaches typically ignore temporal dependencies or rely on unimodal evidence, limiting their ability to provide complete and query-relevant context. We propose a Semantic-Visual Consensus Evidence Selection (SeViCES) framework for effective and reliable long video understanding. SeViCES is training-free and model-agnostic, and introduces two key components. The Semantic-Visual Consensus Frame Selection (SVCFS) module selects frames through (1) a temporal-aware semantic branch that leverages LLM reasoning over captions, and (2) a cluster-guided visual branch that aligns embeddings with semantic scores via mutual information. The Answer Consensus Refinement (ACR) module further resolves inconsistencies between semantic- and visual-based predictions by fusing evidence and constraining the answer space. Extensive experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness, demonstrating the importance of consensus-driven evidence selection for Video-LLMs.

Problem

Research questions and friction points this paper is trying to address.

Addresses long video understanding challenges with complex scattered content

Selects informative frames using semantic-visual consensus evidence selection

Resolves inconsistencies between semantic and visual predictions via fusion

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free framework for long video understanding

Semantic-visual consensus selects key frames

Fuses evidence to refine answer consensus

🔎 Similar Papers

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics