HunyuanVideo 1.5 Technical Report

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

Existing video generation models suffer from excessive parameter counts, high inference costs, and insufficient motion coherence. To address these challenges, this work introduces a lightweight open-source video generation model with 8.3B parameters—the first to enable high-quality, unified text-to-video and image-to-video generation across multiple durations and resolutions on consumer-grade GPUs. Methodologically, we propose Selective Sliding Tile Attention (SSTA), integrate glyph-aware text encoding, and adopt a progressive training strategy to enhance motion modeling and bilingual (Chinese–English) comprehension. Built upon an enhanced DiT architecture, the model incorporates rigorous data curation, an efficient video super-resolution network, and an end-to-end optimization pipeline. Experiments demonstrate state-of-the-art visual quality and motion coherence among open-source models. The code and pretrained weights are fully open-sourced, significantly lowering barriers for research and practical deployment in video generation.

Technology Category

Application Category

📝 Abstract

We present HunyuanVideo 1.5, a lightweight yet powerful open-source video generation model that achieves state-of-the-art visual quality and motion coherence with only 8.3 billion parameters, enabling efficient inference on consumer-grade GPUs. This achievement is built upon several key components, including meticulous data curation, an advanced DiT architecture featuring selective and sliding tile attention (SSTA), enhanced bilingual understanding through glyph-aware text encoding, progressive pre-training and post-training, and an efficient video super-resolution network. Leveraging these designs, we developed a unified framework capable of high-quality text-to-video and image-to-video generation across multiple durations and resolutions.Extensive experiments demonstrate that this compact and proficient model establishes a new state-of-the-art among open-source video generation models. By releasing the code and model weights, we provide the community with a high-performance foundation that lowers the barrier to video creation and research, making advanced video generation accessible to a broader audience. All open-source assets are publicly available at https://github.com/Tencent-Hunyuan/HunyuanVideo-1.5.

Problem

Research questions and friction points this paper is trying to address.

Developing lightweight video generation model with high visual quality

Enabling efficient video generation on consumer-grade GPU hardware

Creating unified framework for text-to-video and image-to-video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Selective and sliding tile attention DiT architecture

Glyph-aware text encoding for bilingual understanding

Efficient video super-resolution network for quality

🔎 Similar Papers

No similar papers found.

Nvidia

The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5. You will also be eligible for equity and benefits.

US, CA, Remote / US, WA, Remote / US, OR, Remote

AI Research Scientist, Computer Vision - Facebook Video Intelligence