InfLVG: Reinforce Inference-Time Consistent Long Video Generation with GRPO

πŸ“… 2025-05-23
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address semantic inconsistency across scenes, weak text-prompt alignment, and prohibitive computational overhead from context growth in autoregressive long-video generation, this paper proposes a training-free, inference-time dynamic context selection framework. Our method introduces: (1) a learnable context selection policy optimized via Group Relative Policy Optimization (GRPO); (2) a multi-objective hybrid reward function jointly optimizing semantic alignment, cross-scene consistency, and artifact suppression; and (3) a dynamic top-K semantic filtering mechanism enabling efficient inference under fixed computational budgets. Evaluated on the CsVBench benchmark and the EPS event prompting dataset, our approach extends video length by up to 9Γ— while significantly improving cross-scene consistency and semantic fidelity. The implementation is publicly available.

Technology Category

Application Category

πŸ“ Abstract
Recent advances in text-to-video generation, particularly with autoregressive models, have enabled the synthesis of high-quality videos depicting individual scenes. However, extending these models to generate long, cross-scene videos remains a significant challenge. As the context length grows during autoregressive decoding, computational costs rise sharply, and the model's ability to maintain consistency and adhere to evolving textual prompts deteriorates. We introduce InfLVG, an inference-time framework that enables coherent long video generation without requiring additional long-form video data. InfLVG leverages a learnable context selection policy, optimized via Group Relative Policy Optimization (GRPO), to dynamically identify and retain the most semantically relevant context throughout the generation process. Instead of accumulating the entire generation history, the policy ranks and selects the top-$K$ most contextually relevant tokens, allowing the model to maintain a fixed computational budget while preserving content consistency and prompt alignment. To optimize the policy, we design a hybrid reward function that jointly captures semantic alignment, cross-scene consistency, and artifact reduction. To benchmark performance, we introduce the Cross-scene Video Benchmark (CsVBench) along with an Event Prompt Set (EPS) that simulates complex multi-scene transitions involving shared subjects and varied actions/backgrounds. Experimental results show that InfLVG can extend video length by up to 9$ imes$, achieving strong consistency and semantic fidelity across scenes. Our code is available at https://github.com/MAPLE-AIGC/InfLVG.
Problem

Research questions and friction points this paper is trying to address.

Enables coherent long video generation without extra data
Reduces computational costs via dynamic context selection
Maintains consistency and prompt alignment across scenes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses GRPO for dynamic context selection
Maintains fixed computational budget via top-K tokens
Hybrid reward optimizes alignment and consistency
πŸ”Ž Similar Papers
No similar papers found.
Xueji Fang
Xueji Fang
Zhejiang University
Diffusion ModelsMultimodal Language ModelsComputer Vision
Liyuan Ma
Liyuan Ma
zhejiang university
image synthesis, generative modelGANDiffusion Model
Z
Zhiyang Chen
MAPLE Lab, Westlake University
M
Mingyuan Zhou
MAPLE Lab, Westlake University
G
Guo-jun Qi
MAPLE Lab, Westlake University