CLEAR: Context-Aware Learning with End-to-End Mask-Free Inference for Adaptive Video Subtitle Removal

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

150K/year

🤖 AI Summary

Existing video caption removal methods rely on explicit masks, which are difficult to deploy and compromise temporal consistency. This work proposes CLEAR, the first end-to-end, mask-free framework for caption removal. CLEAR employs a dual-encoder architecture with a self-supervised orthogonality constraint to disentangle caption and background representations, and integrates LoRA fine-tuning with a generative feedback mechanism to dynamically optimize context-aware reconstruction. By training only 0.77% of the base diffusion model’s parameters, CLEAR achieves a 6.77 dB PSNR improvement and a 74.7% reduction in VFID on a Chinese caption benchmark, while demonstrating strong zero-shot generalization across six additional languages—English, Korean, French, Japanese, Russian, and German.

Technology Category

Application Category

📝 Abstract

Video subtitle removal aims to distinguish text overlays from background content while preserving temporal coherence. Existing diffusion-based methods necessitate explicit mask sequences during both training and inference phases, which restricts their practical deployment. In this paper, we present CLEAR (Context-aware Learning for End-to-end Adaptive Video Subtitle Removal), a mask-free framework that achieves truly end-to-end inference through context-aware adaptive learning. Our two-stage design decouples prior extraction from generative refinement: Stage I learns disentangled subtitle representations via self-supervised orthogonality constraints on dual encoders, while Stage II employs LoRA-based adaptation with generation feedback for dynamic context adjustment. Notably, our method only requires 0.77% of the parameters of the base diffusion model for training. On Chinese subtitle benchmarks, CLEAR outperforms mask-dependent baselines by + 6.77dB PSNR and -74.7% VFID, while demonstrating superior zero-shot generalization across six languages (English, Korean, French, Japanese, Russian, German), a performance enabled by our generation-driven feedback mechanism that ensures robust subtitle removal without ground-truth masks during inference.

Problem

Research questions and friction points this paper is trying to address.

video subtitle removal

mask-free inference

temporal coherence

practical deployment

diffusion-based methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

mask-free inference

context-aware learning

LoRA-based adaptation