AMD-Hummingbird: Towards an Efficient Text-to-Video Model

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the challenge of balancing high fidelity and computational efficiency for text-to-video (T2V) generation on resource-constrained devices (e.g., integrated GPUs, smartphones), this paper proposes a lightweight and efficient T2V framework. Methodologically, it introduces: (1) a novel vision-feedback learning–driven structured pruning strategy for U-Net architectures; (2) an LLM- and VQA-guided semantic alignment pipeline for data augmentation; and (3) support for user-customizable fine-tuning and 26-frame video generation. The resulting model achieves 50% parameter reduction (from 1.4B to 0.7B), a 31× inference speedup, and end-to-end training using only four GPUs. Evaluated on the VBench benchmark, it attains the highest overall score, significantly advancing the practical deployment of T2V models on edge devices.

Technology Category

Application Category

📝 Abstract

Text-to-Video (T2V) generation has attracted significant attention for its ability to synthesize realistic videos from textual descriptions. However, existing models struggle to balance computational efficiency and high visual quality, particularly on resource-limited devices, e.g.,iGPUs and mobile phones. Most prior work prioritizes visual fidelity while overlooking the need for smaller, more efficient models suitable for real-world deployment. To address this challenge, we propose a lightweight T2V framework, termed Hummingbird, which prunes existing models and enhances visual quality through visual feedback learning. Our approach reduces the size of the U-Net from 1.4 billion to 0.7 billion parameters, significantly improving efficiency while preserving high-quality video generation. Additionally, we introduce a novel data processing pipeline that leverages Large Language Models (LLMs) and Video Quality Assessment (VQA) models to enhance the quality of both text prompts and video data. To support user-driven training and style customization, we publicly release the full training code, including data processing and model training. Extensive experiments show that our method achieves a 31X speedup compared to state-of-the-art models such as VideoCrafter2, while also attaining the highest overall score on VBench. Moreover, our method supports the generation of videos with up to 26 frames, addressing the limitations of existing U-Net-based methods in long video generation. Notably, the entire training process requires only four GPUs, yet delivers performance competitive with existing leading methods. Hummingbird presents a practical and efficient solution for T2V generation, combining high performance, scalability, and flexibility for real-world applications.

Problem

Research questions and friction points this paper is trying to address.

Balancing computational efficiency and visual quality in T2V models

Reducing model size while maintaining high-quality video generation

Enhancing text prompts and video data quality for better outputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Lightweight T2V framework prunes existing models

Novel data processing pipeline uses LLMs and VQA

Supports user-driven training and style customization

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs