NeuS-QA: Grounding Long-Form Video Understanding in Temporal Logic and Neuro-Symbolic Reasoning

📅 2025-09-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Long-form video question answering (LVQA) faces challenges in multi-step temporal reasoning and causal understanding. Existing vision-language models (VLMs) suffer from uniform frame sampling, leading to loss of critical frames and temporal cues, while lacking explicit temporal modeling and logical verification mechanisms. This paper proposes a training-free neuro-symbolic pipeline: it formalizes natural language questions as temporal logic formulas, then performs frame-level semantic proposition extraction, video automaton construction, and model checking to enable verifiable temporal logical reasoning. The method requires no fine-tuning and synergizes with off-the-shelf VLMs for zero-shot LVQA. Evaluated on LongVideoBench and CinePile, it achieves over 10% absolute improvement in overall performance, with particularly significant gains—both in accuracy and interpretability—for questions demanding event ordering, causal inference, and compositional temporal reasoning.

Technology Category

Application Category

📝 Abstract
Long-Form Video Question Answering (LVQA) poses challenges beyond traditional visual question answering (VQA), which is often limited to static images or short video clips. While current vision-language models (VLMs) perform well in those settings, they struggle with complex queries in LVQA over long videos involving multi-step temporal reasoning and causality. Vanilla approaches, which sample frames uniformly and feed them to a VLM with the question, incur significant token overhead, forcing severe downsampling. As a result, the model often misses fine-grained visual structure, subtle event transitions, or key temporal cues, ultimately leading to incorrect answers. To address these limitations, recent works have explored query-adaptive frame sampling, hierarchical keyframe selection, and agent-based iterative querying. However, these methods remain fundamentally heuristic: they lack explicit temporal representations and cannot enforce or verify logical event relationships. As a result, there are no formal guarantees that the sampled context actually encodes the compositional or causal logic demanded by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA translates a natural language question into a formal temporal logic expression, constructs a video automaton from frame-level semantic propositions, and applies model checking to rigorously identify video segments satisfying the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on LongVideoBench and CinePile show NeuS-QA improves performance by over 10%, especially on questions involving event ordering, causality, and multi-step compositional reasoning.
Problem

Research questions and friction points this paper is trying to address.

Long-form video QA struggles with multi-step temporal reasoning and causality
Existing methods lack explicit temporal representations for logical event relationships
Current approaches miss fine-grained visual structure due to severe downsampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Translates questions into temporal logic expressions
Constructs video automaton from semantic propositions
Applies model checking to identify verified segments
🔎 Similar Papers
No similar papers found.