CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

📅 2026-03-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video caption removal methods rely on explicit masks, which are difficult to deploy and compromise temporal consistency. This work proposes CLEAR, the first end-to-end, mask-free framework for caption removal. CLEAR employs a dual-encoder architecture with a self-supervised orthogonality constraint to disentangle caption and background representations, and integrates LoRA fine-tuning with a generative feedback mechanism to dynamically optimize context-aware reconstruction. By training only 0.77% of the base diffusion model’s parameters, CLEAR achieves a 6.77 dB PSNR improvement and a 74.7% reduction in VFID on a Chinese caption benchmark, while demonstrating strong zero-shot generalization across six additional languages—English, Korean, French, Japanese, Russian, and German.

Technology Category

Application Category

📝 Abstract
Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.
Problem

Research questions and friction points this paper is trying to address.

video subtitle removal
mask-free inference
temporal coherence
practical deployment
diffusion-based methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

mask-free inference
context-aware learning
LoRA-based adaptation
self-supervised disentanglement
zero-shot generalization
🔎 Similar Papers
2024-02-20International Conference on Machine LearningCitations: 30