CollabVR: Collaborative Video Reasoning with Vision-Language and Video Generation Models

📅 2026-05-09

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing video generation models often suffer from intra-sequence simulation errors and long-term drift in extended multi-step tasks due to the absence of explicit reasoning mechanisms. This work proposes CollabVR, a novel framework that establishes, for the first time, closed-loop collaborative reasoning between vision-language models (VLMs) and video generation models (VGMs) at the step level: the VLM plans actions, verifies generated outputs at each step, and injects diagnostic feedback into the next action prompt to correct errors. By embedding planning and verification directly into the generation loop, CollabVR substantially mitigates error accumulation. Experiments demonstrate that CollabVR significantly outperforms single-pass inference, Pass@$k$, and existing test-time scaling approaches on Gen-ViRe and VBVR-Bench, with the largest gains observed on the most challenging tasks, and further benefits when combined with reasoning-finetuned models.

📝 Abstract

Recent "Thinking with Video" approaches use Video Generation Models (VGMs) for visual reasoning by producing temporally coherent Chain-of-Frames as reasoning artifacts. Even strong VGMs, however, exhibit two recurring failure modes on goal-directed tasks: long-horizon drift on multi-step tasks and mid-clip simulation errors that compound. Both stem from the absence of explicit reasoning built upon the VGM's short-horizon visual prior, a role naturally filled by Vision-Language Models (VLMs), but where to place the VLM is non-trivial: upfront plans commit before any frame is generated and post-hoc critiques over whole videos intervene too late. We propose VLM-VGM Collaborative Video Reasoning (CollabVR), a closed-loop framework that couples the VLM with the VGM at step-level granularity: the VLM plans the immediate next action, inspects the clip the VGM generates, and folds the verifier's diagnosis directly into the next action prompt to repair detected failures. On Gen-ViRe and VBVR-Bench, CollabVR improves both open-source and closed-source VGMs over single-inference, Pass@$k$, and prior test-time scaling baselines at matched compute, with the largest gains on the hardest tasks. It also yields further improvements on top of a reasoning-fine-tuned VGM, indicating that step-level VLM supervision is orthogonal to and stackable with reasoning-oriented fine-tuning. We provide video samples and additional qualitative results at our project page: https://joow0n-kim.github.io/collabvr-project-page.

Problem

Research questions and friction points this paper is trying to address.

Video Generation Models

Visual Reasoning

Long-horizon Drift

Simulation Errors

Vision-Language Models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Collaborative Video Reasoning

Vision-Language Models

Video Generation Models

Step-level Supervision

Closed-loop Framework

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs

2024-06-26Citations: 4

Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models

2024-06-19arXiv.orgCitations: 5