Clue Matters: Leveraging Latent Visual Clues to Empower Video Reasoning

📅 2026-03-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video question answering methods often suffer from severe hallucination, poor interpretability, and insufficient alignment between visual cues and answers due to the lack of explicit structured reasoning. To address these limitations, this work proposes ClueNet, a novel clue-aware video reasoning framework that emulates human hierarchical visual cognition. ClueNet decouples clue extraction and chain-of-thought reasoning through a two-stage supervised fine-tuning strategy and incorporates an adaptive clue filter to refine high-order reasoning. Notably, it achieves substantial improvements in faithfulness, interpretability, and generalization without modifying large foundation models, relying instead on lightweight auxiliary modules. The method consistently outperforms state-of-the-art approaches by at least 1.1% across NExT-QA, STAR, and MVBench benchmarks, effectively bridging the gap between perception and generation.

Technology Category

Application Category

📝 Abstract
Multi-modal Large Language Models (MLLMs) have significantly advanced video reasoning, yet Video Question Answering (VideoQA) remains challenging due to its demand for temporal causal reasoning and evidence-grounded answer generation. Prevailing end-to-end MLLM frameworks lack explicit structured reasoning between visual perception and answer derivation, causing severe hallucinations and poor interpretability. Existing methods also fail to address three core gaps: faithful visual clue extraction, utility-aware clue filtering, and end-to-end clue-answer alignment. Inspired by hierarchical human visual cognition, we propose ClueNet, a clue-aware video reasoning framework with a two-stage supervised fine-tuning paradigm without extensive base model modifications. Decoupled supervision aligns clue extraction and chain-based reasoning, while inference supervision with an adaptive clue filter refines high-order reasoning, alongside lightweight modules for efficient inference. Experiments on NExT-QA, STAR, and MVBench show that ClueNet outperforms state-of-the-art methods by $\ge$ 1.1%, with superior generalization, hallucination mitigation, inference efficiency, and cross-backbone compatibility. This work bridges the perception-to-generation gap in MLLM video understanding, providing an interpretable, faithful reasoning paradigm for high-stakes VideoQA applications.
Problem

Research questions and friction points this paper is trying to address.

Video Question Answering
Visual Clues
Temporal Reasoning
Hallucination
Multi-modal Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

clue-aware reasoning
video question answering
multi-modal large language models
structured reasoning
adaptive clue filtering
🔎 Similar Papers
No similar papers found.
K
Kaixin Zhang
Aerospace Information Research Institute, Chinese Academy of Sciences; University of Chinese Academy of Sciences
X
Xiaohe Li
Aerospace Information Research Institute, Chinese Academy of Sciences
J
Jiahao Li
Aerospace Information Research Institute, Chinese Academy of Sciences; University of Chinese Academy of Sciences
H
Haohua Wu
Aerospace Information Research Institute, Chinese Academy of Sciences; University of Chinese Academy of Sciences
Xinyu Zhao
Xinyu Zhao
The University of North Carolina at Chapel Hill
Z
Zide Fan
Aerospace Information Research Institute, Chinese Academy of Sciences; University of Chinese Academy of Sciences
L
Lei Wang
Aerospace Information Research Institute, Chinese Academy of Sciences; University of Chinese Academy of Sciences