🤖 AI Summary
Existing video question answering methods often suffer from severe hallucination, poor interpretability, and insufficient alignment between visual cues and answers due to the lack of explicit structured reasoning. To address these limitations, this work proposes ClueNet, a novel clue-aware video reasoning framework that emulates human hierarchical visual cognition. ClueNet decouples clue extraction and chain-of-thought reasoning through a two-stage supervised fine-tuning strategy and incorporates an adaptive clue filter to refine high-order reasoning. Notably, it achieves substantial improvements in faithfulness, interpretability, and generalization without modifying large foundation models, relying instead on lightweight auxiliary modules. The method consistently outperforms state-of-the-art approaches by at least 1.1% across NExT-QA, STAR, and MVBench benchmarks, effectively bridging the gap between perception and generation.
📝 Abstract
Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.