🤖 AI Summary
This work proposes a teacher-free, reward-guided autoregressive framework for video generation, addressing the limitations of existing autoregressive methods that rely on high-quality teacher models—constraints that hinder both performance and scalability, particularly when such teachers are unavailable, leading to inferior generation quality compared to bidirectional models. By incorporating reward signals from reinforcement learning, the proposed approach optimizes the generation process while maintaining visual fidelity and temporal consistency, significantly simplifying the training pipeline. Evaluated on the VBench benchmark, the method achieves a score of 84.92, outperforming comparable autoregressive approaches that depend on complex heterogeneous distillation (84.31) and approaching state-of-the-art performance, thereby demonstrating its effectiveness and scalability.
📝 Abstract
While most prior work in video generation relies on bidirectional architectures, recent efforts have sought to adapt these models into autoregressive variants to support near real-time generation. However, such adaptations often depend heavily on teacher models, which can limit performance, particularly in the absence of a strong autoregressive teacher, resulting in output quality that typically lags behind their bidirectional counterparts. In this paper, we explore an alternative approach that uses reward signals to guide the generation process, enabling more efficient and scalable autoregressive generation. By using reward signals to guide the model, our method simplifies training while preserving high visual fidelity and temporal consistency. Through extensive experiments on standard benchmarks, we find that our approach performs comparably to existing autoregressive models and, in some cases, surpasses similarly sized bidirectional models by avoiding constraints imposed by teacher architectures. For example, on VBench, our method achieves a total score of 84.92, closely matching state-of-the-art autoregressive methods that score 84.31 but require significant heterogeneous distillation.