A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Diffusion-based video generation models suffer from a semantic gap between users’ vague, concise instructions and the fine-grained prompts used during training, leading to poor controllability and instruction-output misalignment. To address this, we propose ReaDe—a universal, model-agnostic instruction interpreter that introduces a novel “reason-then-describe” paradigm. ReaDe explicitly models complex instruction logic via stepwise reasoning trajectories and employs a multi-dimensional reward mechanism to refine natural language descriptions with fine-grained calibration. Its two-stage training—reasoning-enhanced supervised parsing followed by reward-driven language fine-tuning—significantly improves zero-shot generalization to unseen complex instructions. Experiments demonstrate that ReaDe substantially enhances instruction fidelity, description accuracy, and video quality across both single- and multi-condition generation settings. Moreover, it seamlessly integrates with diverse downstream video generation models, exhibiting strong universality and consistent performance gains.

Technology Category

Application Category

📝 Abstract
Diffusion Transformers have significantly improved video fidelity and temporal coherence, however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent-output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions, and (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across single- and multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoning-intensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent. Project Page: https://sqwu.top/ReaDe/.
Problem

Research questions and friction points this paper is trying to address.

Limited practical controllability in video generation from user inputs
Intent-output mismatch between concise user instructions and training prompts
Difficulty converting ambiguous instructions into precise video specifications
Innovation

Methods, ideas, or system contributions that make the work stand out.

ReaDe interpreter converts instructions into precise specifications
Uses reason-then-describe paradigm to analyze requirements and resolve ambiguities
Employs two-stage optimization with reasoning supervision and reward refinement
🔎 Similar Papers
No similar papers found.