🤖 AI Summary
Existing video caption removal methods rely on explicit masks, which are difficult to deploy and compromise temporal consistency. This work proposes CLEAR, the first end-to-end, mask-free framework for caption removal. CLEAR employs a dual-encoder architecture with a self-supervised orthogonality constraint to disentangle caption and background representations, and integrates LoRA fine-tuning with a generative feedback mechanism to dynamically optimize context-aware reconstruction. By training only 0.77% of the base diffusion model’s parameters, CLEAR achieves a 6.77 dB PSNR improvement and a 74.7% reduction in VFID on a Chinese caption benchmark, while demonstrating strong zero-shot generalization across six additional languages—English, Korean, French, Japanese, Russian, and German.
📝 Abstract
Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.