🤖 AI Summary
Text-to-video generation suffers from fine-grained text–video misalignment, particularly in complex multi-object scenes. To address this, we propose a training-free, model-agnostic two-stage post-hoc refinement framework. First, a multimodal large language model (MLLM) performs fine-grained visual question answering to automatically localize mismatched regions and generate spatially grounded textual feedback. Second, region-preserving segmentation (RPS) and frame-level spatial masks guide localized re-generation. This is the first work to leverage MLLMs for automated alignment diagnosis and feedback generation, and the first model-agnostic, zero-training video post-refinement paradigm. Our method achieves state-of-the-art performance on EvalCrafter and T2V-CompBench, with significant improvements in alignment metrics—including CLIPScore and T2V-Sim—demonstrating robust and precise correction capability in complex scenarios.
📝 Abstract
Recent text-to-video (T2V) diffusion models have demonstrated impressive generation capabilities across various domains. However, these models often generate videos that have misalignments with text prompts, especially when the prompts describe complex scenes with multiple objects and attributes. To address this, we introduce VideoRepair, a novel model-agnostic, training-free video refinement framework that automatically identifies fine-grained text-video misalignments and generates explicit spatial and textual feedback, enabling a T2V diffusion model to perform targeted, localized refinements. VideoRepair consists of two stages: In (1) video refinement planning, we first detect misalignments by generating fine-grained evaluation questions and answering them using an MLLM. Based on video evaluation outputs, we identify accurately generated objects and construct localized prompts to precisely refine misaligned regions. In (2) localized refinement, we enhance video alignment by 'repairing' the misaligned regions from the original video while preserving the correctly generated areas. This is achieved by frame-wise region decomposition using our Region-Preserving Segmentation (RPS) module. On two popular video generation benchmarks (EvalCrafter and T2V-CompBench), VideoRepair substantially outperforms recent baselines across various text-video alignment metrics. We provide a comprehensive analysis of VideoRepair components and qualitative examples.