Helios: Real Real-Time Long Video Generation Model

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in long-form video generation—namely temporal drift, poor real-time performance, and high computational costs—by introducing Helios, a 14B-parameter autoregressive diffusion model capable of unified text-, image-, and video-to-video generation. Through a training strategy that simulates temporal drift, combined with historical and noise context compression, few-step sampling, and infrastructure-level optimizations, Helios achieves high-quality, minute-long video generation at 19.5 FPS on a single NVIDIA H100 GPU. Notably, it accomplishes this without relying on conventional anti-drift heuristics or acceleration mechanisms such as KV caching or sparse attention. The approach substantially reduces both memory footprint and computational overhead, outperforming strong baselines across both short- and long-form video generation tasks.

Technology Category

Application Category

📝 Abstract
We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
Problem

Research questions and friction points this paper is trying to address.

long-video generation
video drifting
real-time generation
computational efficiency
training scalability
Innovation

Methods, ideas, or system contributions that make the work stand out.

long-video generation
real-time inference
drift mitigation
memory-efficient training
autoregressive diffusion model
🔎 Similar Papers
No similar papers found.