Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

📅 2025-03-17

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

To address the challenge of aligning sparse keyframe sampling with textual query logic in long-video understanding, this paper proposes the Visual-Semantic Logical Search (VSLS) framework. VSLS is the first to incorporate a formal logical dependency system into keyframe retrieval, dynamically modeling four types of logical relations—spatial co-occurrence, temporal proximity, attribute dependence, and causal ordering—and establishing an iterative semantic-logical joint verification mechanism to bridge the logical gap between textual queries and visual temporal reasoning. The method integrates visual-semantic embedding, dynamic distribution reweighting, and iterative refinement, and is designed for multi-granularity video question answering. On a human-annotated benchmark, VSLS achieves new state-of-the-art performance in keyframe selection. Downstream evaluation on LongVideoBench and Video-MME demonstrates significant improvements over existing approaches. The code is publicly available.

Technology Category

Application Category

📝 Abstract

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to ``finding a needle in a haystack.'' To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Addresses logical relationships in long video understanding

Introduces semantics-driven keyframe selection framework

Improves video question-answering with logical dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Semantic-Logical Search framework

Dynamic keyframe selection via logical dependencies

Iterative refinement for context-aware frame identification

🔎 Similar Papers

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA