AMD-Hummingbird: Towards an Efficient Text-to-Video Model

๐Ÿ“… 2025-03-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the challenge of balancing high fidelity and computational efficiency for text-to-video (T2V) generation on resource-constrained devices (e.g., integrated GPUs, smartphones), this paper proposes a lightweight and efficient T2V framework. Methodologically, it introduces: (1) a novel vision-feedback learningโ€“driven structured pruning strategy for U-Net architectures; (2) an LLM- and VQA-guided semantic alignment pipeline for data augmentation; and (3) support for user-customizable fine-tuning and 26-frame video generation. The resulting model achieves 50% parameter reduction (from 1.4B to 0.7B), a 31ร— inference speedup, and end-to-end training using only four GPUs. Evaluated on the VBench benchmark, it attains the highest overall score, significantly advancing the practical deployment of T2V models on edge devices.

Technology Category

Application Category

๐Ÿ“ Abstract
Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.
Problem

Research questions and friction points this paper is trying to address.

Balancing computational efficiency and visual quality in T2V models
Reducing model size while maintaining high-quality video generation
Enhancing text prompts and video data quality for better outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight T2V framework prunes existing models
Novel data processing pipeline uses LLMs and VQA
Supports user-driven training and style customization
๐Ÿ”Ž Similar Papers
No similar papers found.