🤖 AI Summary
Existing image-to-video models struggle to jointly model the alpha channel with RGB, resulting in low-quality animated text with poor transparency rendering and entangled features. This work proposes TransText, a novel framework that introduces an Alpha-as-RGB paradigm, embedding the alpha channel as an RGB-compatible visual signal without requiring retraining of the VAE or modifications to pretrained generative manifolds. By leveraging a spatial concatenation mechanism in latent space, TransText preserves RGB semantic priors while ensuring strict alignment between RGB and alpha channels, thereby avoiding feature entanglement. Experimental results demonstrate that TransText significantly outperforms baseline methods, generating high-fidelity, temporally coherent text animations with fine-grained transparency effects.
📝 Abstract
We introduce the first method, to the best of our knowledge, for adapting image-to-video models to layer-aware text (glyph) animation, a capability critical for practical dynamic visual design. Existing approaches predominantly handle the transparency-encoding (alpha channel) as an extra latent dimension appended to the RGB space, necessitating the reconstruction of the underlying RGB-centric variational autoencoder (VAE). However, given the scarcity of high-quality transparent glyph data, retraining the VAE is computationally expensive and may erode the robust semantic priors learned from massive RGB corpora, potentially leading to latent pattern mixing. To mitigate these limitations, we propose TransText, a framework based on a novel Alpha-as-RGB paradigm to jointly model appearance and transparency without modifying the pre-trained generative manifold. TransText embeds the alpha channel as an RGB-compatible visual signal through latent spatial concatenation, explicitly ensuring strict cross-modal (RGB-and-Alpha) consistency while preventing feature entanglement. Our experiments demonstrate that TransText significantly outperforms baselines, generating coherent, high-fidelity transparent animations with diverse, fine-grained effects.