Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

📅 2025-10-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal video reasoning faces two key challenges: existing reinforcement learning (RL) methods rely solely on textual reasoning chains, leading to visual hallucinations; while frame-retrieval approaches incorporate visual grounding, they suffer from inaccurate evidence localization. To address these, we propose AIR (Attend-Identify-Reason), an RL framework that integrates multi-stage progressive cold-start training with automated reasoning trajectory generation, enabling visual-evidence-driven, cross-frame, multi-step causal reasoning. To support training, we introduce Conan-91K, a large-scale dataset of multi-step reasoning trajectories. AIR achieves new state-of-the-art performance—averaging over 10% absolute improvement over Qwen2.5-VL-7B-Instruct across six benchmarks—and demonstrates strong generalization on long-video tasks. Our core contribution is the first unified RL framework that jointly optimizes dynamic visual evidence retrieval and interpretable, multi-step reasoning within a single coherent architecture.

Technology Category

Application Category

📝 Abstract
Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.
Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-step video reasoning in multimodal language models
Addressing visual grounding and evidence localization inaccuracies
Improving progressive learning for cross-frame clue reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive learning strategy for multi-step reasoning
Identification-Reasoning-Action RLVR training framework
Large-scale dataset with automated reasoning traces
🔎 Similar Papers
No similar papers found.
Kun Ouyang
Kun Ouyang
National University of Singapore
human mobilitymachine learning
Yuanxin Liu
Yuanxin Liu
Peking University
Natural Language Processing
Linli Yao
Linli Yao
Peking University
multi-modal semantic understanding
Y
Yishuo Cai
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University
H
Hao Zhou
WeChat AI, Tencent Inc., China
J
Jie Zhou
WeChat AI, Tencent Inc., China
Fandong Meng
Fandong Meng
WeChat AI, Tencent
Machine TranslationNatural Language Processing
X
Xu Sun
State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University