Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

📅 2025-11-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Text-to-video generation models struggle to incorporate fine-grained, physically grounded camera parameters—such as shutter speed and aperture—under data scarcity, as high-fidelity real-world video datasets annotated with such parameters are prohibitively expensive and rare. Method: We propose a counterintuitive yet highly effective fine-tuning paradigm that leverages only sparse, low-fidelity synthetic data—generated via lightweight procedural rendering—to achieve superior controllability compared to methods trained on scarce high-quality real data. Our approach introduces a diffusion-based controllable fine-tuning framework, integrating parameter-aware conditioning mechanisms with efficient synthetic data generation, and provides theoretical analysis explaining why lower-quality synthetic data can yield better generalization for physical parameter control. Results: Extensive experiments demonstrate significant improvements over real-data baselines across multiple quantitative metrics (e.g., parameter fidelity score, temporal consistency) and human visual assessments, establishing the first method enabling precise, robust control over physical camera parameters in text-to-video synthesis.

Technology Category

Application Category

📝 Abstract

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

Problem

Research questions and friction points this paper is trying to address.

Fine-tuning video diffusion models requires large high-quality datasets

Proposing data-efficient adaptation using sparse low-quality synthetic data

Achieving superior controllable generation compared to photorealistic data training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-efficient fine-tuning for video generation controls

Learning controls from sparse low-quality synthetic data

Superior results compared to photorealistic data fine-tuning

🔎 Similar Papers

CameraCtrl: Enabling Camera Control for Text-to-Video Generation