LongCat-Video Technical Report

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

To address poor temporal coherence and low inference efficiency at high resolutions in long-video generation, this paper proposes a unified multi-task foundation model for long-video synthesis based on the Diffusion Transformer (DiT). Methodologically: (i) we design a unified DiT architecture supporting text-to-video, image-to-video, and video continuation; (ii) we enhance temporal modeling via pretraining and adopt a coarse-to-fine spatiotemporal generation strategy; (iii) we integrate block-sparse attention with multi-objective reward-based reinforcement learning from human feedback (RLHF) to improve fidelity and controllability. The 13.6B-parameter model generates minute-long videos at 720p resolution and 30 fps, achieving temporal coherence and visual quality competitive with leading closed- and open-source models. Code and pretrained weights are publicly released.

Technology Category

Application Category

📝 Abstract

Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.

Problem

Research questions and friction points this paper is trying to address.

Developing efficient long video generation for world models

Creating unified architecture for multiple video generation tasks

Achieving high-quality temporal coherence in minutes-long videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformer for unified multi-task video generation

Employs coarse-to-fine strategy with Block Sparse Attention

Applies multi-reward RLHF training for enhanced performance

🔎 Similar Papers

LVBench: An Extreme Long Video Understanding Benchmark