ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement

📅 2024-12-25

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

To address key challenges in text-to-video (T2V) generation—including prohibitively high pretraining costs, suboptimal performance under resource constraints, and weak comprehension of complex instructions—this paper proposes ModelGrow, a continual universal pretraining framework. It is the first systematic exploration of continual pretraining in T2V. ModelGrow introduces a model-capacity self-adaptive expansion mechanism, comprising parameter-incremental injection and hierarchical architectural expansion, and integrates a large language model as a plug-and-play high-level text encoder to enhance semantic understanding. Coupled with a multi-stage continual training strategy and cross-modal alignment objectives, it significantly improves fine-grained text responsiveness, long-term temporal coherence, and semantic fidelity. ModelGrow achieves state-of-the-art performance across multiple T2V benchmarks, surpassing all baselines. The code and models will be publicly released.

Technology Category

Application Category

📝 Abstract

Text-to-video (T2V) generation has gained significant attention recently. However, the costs of training a T2V model from scratch remain persistently high, and there is considerable room for improving the generation performance, especially under limited computation resources. This work explores the continual general pre-training of text-to-video models, enabling the model to"grow"its abilities based on a pre-trained foundation, analogous to how humans acquire new knowledge based on past experiences. There is a lack of extensive study of the continual pre-training techniques in T2V generation. In this work, we take the initial step toward exploring this task systematically and propose ModelGrow. Specifically, we break this task into two key aspects: increasing model capacity and improving semantic understanding. For model capacity, we introduce several novel techniques to expand the model size, enabling it to store new knowledge and improve generation performance. For semantic understanding, we propose a method that leverages large language models as advanced text encoders, integrating them into T2V models to enhance language comprehension and guide generation results according to detailed prompts. This approach enables the model to achieve better semantic alignment, particularly in response to complex user prompts. Extensive experiments demonstrate the effectiveness of our method across various metrics. The source code and the model of ModelGrow will be publicly available.

Problem

Research questions and friction points this paper is trying to address.

Text-to-Video Generation

Resource Constraints

Complex Instruction Understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

ModelGrow

Text-to-Video Optimization

Large-scale Language Model Integration

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs