🤖 AI Summary
Long-form video question answering (LVQA) faces challenges in multi-step temporal reasoning and causal understanding. Existing vision-language models (VLMs) suffer from uniform frame sampling, leading to loss of critical frames and temporal cues, while lacking explicit temporal modeling and logical verification mechanisms. This paper proposes a training-free neuro-symbolic pipeline: it formalizes natural language questions as temporal logic formulas, then performs frame-level semantic proposition extraction, video automaton construction, and model checking to enable verifiable temporal logical reasoning. The method requires no fine-tuning and synergizes with off-the-shelf VLMs for zero-shot LVQA. Evaluated on LongVideoBench and CinePile, it achieves over 10% absolute improvement in overall performance, with particularly significant gains—both in accuracy and interpretability—for questions demanding event ordering, causal inference, and compositional temporal reasoning.
📝 Abstract
Long-Form Video Question Answering (LVQA) poses challenges beyond traditional visual question answering (VQA), which is often limited to static images or short video clips. While current vision-language models (VLMs) perform well in those settings, they struggle with complex queries in LVQA over long videos involving multi-step temporal reasoning and causality. Vanilla approaches, which sample frames uniformly and feed them to a VLM with the question, incur significant token overhead, forcing severe downsampling. As a result, the model often misses fine-grained visual structure, subtle event transitions, or key temporal cues, ultimately leading to incorrect answers. To address these limitations, recent works have explored query-adaptive frame sampling, hierarchical keyframe selection, and agent-based iterative querying. However, these methods remain fundamentally heuristic: they lack explicit temporal representations and cannot enforce or verify logical event relationships. As a result, there are no formal guarantees that the sampled context actually encodes the compositional or causal logic demanded by the question. To address these foundational gaps, we introduce NeuS-QA, a training-free, plug-and-play neuro-symbolic pipeline for LVQA. NeuS-QA translates a natural language question into a formal temporal logic expression, constructs a video automaton from frame-level semantic propositions, and applies model checking to rigorously identify video segments satisfying the question's logical requirements. Only these logic-verified segments are submitted to the VLM, thus improving interpretability, reducing hallucinations, and enabling compositional reasoning without modifying or fine-tuning the model. Experiments on LongVideoBench and CinePile show NeuS-QA improves performance by over 10%, especially on questions involving event ordering, causality, and multi-step compositional reasoning.