Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Multimodal video reasoning faces two key challenges: existing reinforcement learning (RL) methods rely solely on textual reasoning chains, leading to visual hallucinations; while frame-retrieval approaches incorporate visual grounding, they suffer from inaccurate evidence localization. To address these, we propose AIR (Attend-Identify-Reason), an RL framework that integrates multi-stage progressive cold-start training with automated reasoning trajectory generation, enabling visual-evidence-driven, cross-frame, multi-step causal reasoning. To support training, we introduce Conan-91K, a large-scale dataset of multi-step reasoning trajectories. AIR achieves new state-of-the-art performance—averaging over 10% absolute improvement over Qwen2.5-VL-7B-Instruct across six benchmarks—and demonstrates strong generalization on long-video tasks. Our core contribution is the first unified RL framework that jointly optimizes dynamic visual evidence retrieval and interpretable, multi-step reasoning within a single coherent architecture.

Technology Category

Application Category

📝 Abstract

Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification-Reasoning-Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.

Problem

Research questions and friction points this paper is trying to address.

Enhancing multi-step video reasoning in multimodal language models

Addressing visual grounding and evidence localization inaccuracies

Improving progressive learning for cross-frame clue reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive learning strategy for multi-step reasoning

Identification-Reasoning-Action RLVR training framework

Large-scale dataset with automated reasoning traces

🔎 Similar Papers

No similar papers found.