🤖 AI Summary
While existing neuro-symbolic methods achieve high accuracy in long-form video question answering, their inference latency can be up to 90 times that of baseline vision-language model (VLM) prompting, rendering them impractical for edge deployment. This work proposes an adaptive temporal verification framework that alleviates the frame-level dense reasoning bottleneck through CLIP-guided two-stage adaptive sampling and a batched propositional verification mechanism, while also establishing theoretical latency bounds. By integrating CLIP visual features, formal automaton-based verification, and batched VLM inference, the method reduces inference latency to approximately 10 times that of the baseline on LongVideoBench and Video-MME, while simultaneously achieving over a 10% absolute improvement in accuracy on complex temporal queries.
📝 Abstract
Neuro-symbolic approaches to long-form video question answering (LVQA) have demonstrated significant accuracy improvements by grounding temporal reasoning in formal verification. However, existing methods incur prohibitive latency overheads, up to 90x slower than base VLM prompting, rendering them impractical for latency-sensitive edge deployments. We present LE-NeuS, a latency-efficient neuro-symbolic framework that preserves the accuracy benefits of temporal logic-guided video understanding while drastically reducing inference latency. Our key insight is that the dominant computational bottleneck arises from sequential and dense proposition detection across video frames during automaton construction. We address this through two principled optimizations: (1) CLIP guided two-stage adaptive sampling that exploits visual redundancy to skip semantically similar frames while preserving temporal boundaries, and (2) batched proposition detection that parallelizes VLM inference across temporal windows. Theoretically, we derive latency bounds as a function of video length, proposition complexity, and sampling density, establishing conditions under which latency efficiency is achievable. Empirically, on LongVideoBench and Video-MME benchmarks deployed on NVIDIA H100 GPUs, LE-NeuS reduces the latency gap from 90x to approximately 10x while maintaining>10% accuracy gains on temporally complex queries.