TransPixar: Advancing Text-to-Video Generation with Transparency

📅 2025-01-06

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing video generation methods struggle to produce high-fidelity RGBA videos under data scarcity, limiting their applicability in visual effects (VFX) and other compositing tasks requiring precise alpha-channel control. To address this, we propose the first diffusion-based framework for text-to-RGBA video generation. Our approach introduces a dedicated discrete tokenization scheme for the alpha channel and establishes an RGB-alpha joint consistency modeling mechanism. Built upon the DiT architecture, it incorporates LoRA-based lightweight fine-tuning and attention optimization, enabling synchronized generation of temporally coherent, high-fidelity RGB and alpha sequences from minimal annotated RGBA training data. Experiments demonstrate that our method significantly improves alpha-channel accuracy and temporal stability—without compromising RGB quality—thereby greatly enhancing the readiness of generated videos for professional compositing pipelines.

Technology Category

Application Category

📝 Abstract

Text-to-video generative models have made significant strides, enabling diverse applications in entertainment, advertising, and education. However, generating RGBA video, which includes alpha channels for transparency, remains a challenge due to limited datasets and the difficulty of adapting existing models. Alpha channels are crucial for visual effects (VFX), allowing transparent elements like smoke and reflections to blend seamlessly into scenes. We introduce TransPixar, a method to extend pretrained video models for RGBA generation while retaining the original RGB capabilities. TransPixar leverages a diffusion transformer (DiT) architecture, incorporating alpha-specific tokens and using LoRA-based fine-tuning to jointly generate RGB and alpha channels with high consistency. By optimizing attention mechanisms, TransPixar preserves the strengths of the original RGB model and achieves strong alignment between RGB and alpha channels despite limited training data. Our approach effectively generates diverse and consistent RGBA videos, advancing the possibilities for VFX and interactive content creation.

Problem

Research questions and friction points this paper is trying to address.

Video Generation

High-Quality Transparency (RGBA)

Data Scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

TransPixar

RGBA video generation

diffusion converter

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

AI Research Scientist, Computer Vision - Facebook Video Intelligence