Process-of-Thought Reasoning for Videos

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited capacity of existing video understanding methods to perform traceable, multi-step temporal reasoning under long-duration and noisy observations. To bridge this gap, the authors propose the "Process-of-Thought" (PoT) framework, which introduces explicit, interpretable multi-step reasoning into video understanding for the first time. PoT constructs lightweight, verifiable reasoning trajectories by iteratively performing temporal evidence selection, state updating, and constrained answer generation. The model-agnostic architecture seamlessly integrates with mainstream vision-language backbones and supports both closed-book and tool-augmented reasoning. Experiments demonstrate that PoT significantly improves factual correctness and temporal localization accuracy while producing interpretable intermediate reasoning steps, facilitating error diagnosis and downstream applications.

Technology Category

Application Category

📝 Abstract
Video understanding requires not only recognizing visual content but also performing temporally grounded, multi-step reasoning over long and noisy observations. We propose Process-of-Thought (PoT) Reasoning for Videos, a framework that makes the reasoning process explicit by structuring video inference into a sequence of lightweight, verifiable steps. PoT interleaves (i) temporal evidence selection, (ii) step-wise state updates, and (iii) constrained answer synthesis, enabling the model to progressively refine hypotheses while maintaining traceability to video evidence. The framework is designed to be model-agnostic and can be plugged into existing vision-language backbones, supporting both closed-book reasoning and evidence-augmented reasoning with external tools. We further introduce a unified representation for PoT traces that aligns intermediate decisions with temporal segments, which improves robustness to distractors and reduces hallucinated explanations. Extensive experiments on standard video reasoning tasks demonstrate that PoT consistently improves factual correctness and temporal grounding, while providing interpretable reasoning traces for diagnosis and downstream use.
Problem

Research questions and friction points this paper is trying to address.

video understanding
temporal reasoning
multi-step reasoning
noisy observations
temporal grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Process-of-Thought
video reasoning
temporal grounding
interpretable reasoning
model-agnostic framework
🔎 Similar Papers
No similar papers found.