LongCat-Video Technical Report

📅 2025-10-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor temporal coherence and low inference efficiency at high resolutions in long-video generation, this paper proposes a unified multi-task foundation model for long-video synthesis based on the Diffusion Transformer (DiT). Methodologically: (i) we design a unified DiT architecture supporting text-to-video, image-to-video, and video continuation; (ii) we enhance temporal modeling via pretraining and adopt a coarse-to-fine spatiotemporal generation strategy; (iii) we integrate block-sparse attention with multi-objective reward-based reinforcement learning from human feedback (RLHF) to improve fidelity and controllability. The 13.6B-parameter model generates minute-long videos at 720p resolution and 30 fps, achieving temporal coherence and visual quality competitive with leading closed- and open-source models. Code and pretrained weights are publicly released.

Technology Category

Application Category

📝 Abstract
Video generation is a critical pathway toward world models, with efficient long video inference as a key capability. Toward this end, we introduce LongCat-Video, a foundational video generation model with 13.6B parameters, delivering strong performance across multiple video generation tasks. It particularly excels in efficient and high-quality long video generation, representing our first step toward world models. Key features include: Unified architecture for multiple tasks: Built on the Diffusion Transformer (DiT) framework, LongCat-Video supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model; Long video generation: Pretraining on Video-Continuation tasks enables LongCat-Video to maintain high quality and temporal coherence in the generation of minutes-long videos; Efficient inference: LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy along both the temporal and spatial axes. Block Sparse Attention further enhances efficiency, particularly at high resolutions; Strong performance with multi-reward RLHF: Multi-reward RLHF training enables LongCat-Video to achieve performance on par with the latest closed-source and leading open-source models. Code and model weights are publicly available to accelerate progress in the field.
Problem

Research questions and friction points this paper is trying to address.

Developing efficient long video generation for world models
Creating unified architecture for multiple video generation tasks
Achieving high-quality temporal coherence in minutes-long videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformer for unified multi-task video generation
Employs coarse-to-fine strategy with Block Sparse Attention
Applies multi-reward RLHF training for enhanced performance
🔎 Similar Papers
No similar papers found.
M
Meituan LongCat Team
X
Xunliang Cai
Q
Qilong Huang
Z
Zhuoliang Kang
H
Hongyu Li
S
Shijun Liang
Liya Ma
Liya Ma
University of Malaya
RF-MEMSPrintable electronicsMicroelectronics
Siyu Ren
Siyu Ren
Shanghai Jiao Tong University
NLP
Xiaoming Wei
Xiaoming Wei
Meituan
computer visionmachine learning
R
Rixu Xie
T
Tong Zhang