🤖 AI Summary
Existing text-to-image (T2I) diffusion models generate temporally incoherent videos due to frame-wise independent modeling; existing coherence-enhancement methods often require large-model fine-tuning or incur prohibitive computational overhead. This paper proposes GE-Adapter, a lightweight framework for cost-effective, high-fidelity text-driven video editing without fine-tuning pre-trained T2I models. Our approach introduces three core modules: a Frame-level Temporal Consistency (FTC) module, a Spatial Channel Dependency (SCD) module, and a Token-level Semantic Consistency (TSC) module—integrated with bidirectional DDIM inversion, temporal-aware loss, bilateral filtering enhancement, and a hybrid prompt mechanism combining shared and frame-specific tokens. Evaluated on MSR-VTT, GE-Adapter significantly improves perceptual quality, text-video alignment, frame fidelity, and transition smoothness, while achieving markedly superior temporal coherence compared to existing low-cost alternatives.
📝 Abstract
Recent advancements in text-to-image (T2I) generation using diffusion models have enabled cost-effective video-editing applications by leveraging pre-trained models, eliminating the need for resource-intensive training. However, the frame-independence of T2I generation often results in poor temporal consistency. Existing methods address this issue through temporal layer fine-tuning or inference-based temporal propagation, but these approaches suffer from high training costs or limited temporal coherence. To address these challenges, we propose a General and Efficient Adapter (GE-Adapter) that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion. This framework introduces three key components: (1) Frame-based Temporal Consistency Blocks (FTC Blocks) to capture frame-specific features and enforce smooth inter-frame transitions via temporally-aware loss functions; (2) Channel-dependent Spatial Consistency Blocks (SCD Blocks) employing bilateral filters to enhance spatial coherence by reducing noise and artifacts; and (3) Token-based Semantic Consistency Module (TSC Module) to maintain semantic alignment using shared prompt tokens and frame-specific tokens. Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset. Additionally, it achieves enhanced fidelity and frame-to-frame coherence, offering a practical solution for T2V editing.