Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

πŸ“… 2026-03-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the misalignment between outputs of distilled autoregressive video generation models and human visual preferences, as well as the high computational and memory costs of existing reinforcement learning approaches. To this end, the authors propose an efficient online reinforcement learning framework that introduces a novel forward-process reinforcement learning paradigm. This framework integrates negative perceptual fine-tuning, streaming training based on rolling key-value caching, localized segment-wise policy updates, multi-objective reward optimization, and a dynamic reference mechanism. Together, these components significantly reduce resource consumption while enhancing temporal consistency and alignment with human preferences in long-form video generation. Experiments demonstrate that the method achieves efficient, stable, and scalable performance improvements across multiple distilled autoregressive video models.

Technology Category

Application Category

πŸ“ Abstract
Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
Problem

Research questions and friction points this paper is trying to address.

distilled autoregressive video models
human visual preferences
reinforcement learning
forward-process
reward alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

forward-process reinforcement learning
distilled autoregressive video models
negative-aware fine-tuning
streaming training
multi-reward alignment
πŸ”Ž Similar Papers
2024-07-10arXiv.orgCitations: 3