🤖 AI Summary
Existing video subtitle removal methods typically rely on two-stage pipelines and explicit masks, where segmentation errors often degrade editing quality. This work proposes the first mask-free, single-stage framework for subtitle removal, leveraging a first-order diffusion Transformer coupled with Rectified Flow to perform local conditional editing directly on raw videos. Theoretical analysis demonstrates that the approach yields an optimal transport map satisfying Lipschitz continuity under localized edits. By integrating a hybrid training strategy guided by clean latent representations of the first frame and a chunk-based streaming inference mechanism, the method efficiently processes 1440p videos of arbitrary length, maintaining temporal consistency in dynamic scenes while eliminating stitching artifacts and achieving high-quality subtitle removal.
📝 Abstract
Recent breakthroughs in video diffusion models have significantly accelerated the development of video editing techniques. However, existing methods often rely on inpainting video frames based on masked input, which requires extracting the target video mask in advance, and the precision of the segmentation directly affects the quality of the completion. In this paper, we present SEDiT, a novel one-stage video Subtitle Erasure approach via One-step Diffusion Transformer. We introduce a mask-free inference approach that enables direct erasure of the targeted subtitle. The proposed one-stage framework mitigates the sub-optimality inherent in the two-stage processing of prior models. Since subtitle removal is a localized editing task in which most pixels remain unchanged, the underlying distribution shift is minimal, making it well-suited to one-step generation under rectified flow. We empirically validate the reliability of one-step denoising and further provide a formal theoretical justification. Under the localized-editing structure of subtitle removal, the conditional optimal transport (OT) map and its induced rectified flow velocity field are Lipschitz continuous with respect to the latent variable, which underpins the theoretical feasibility of one-step sampling. To address the challenge of long-term temporal consistency, we adopt a hybrid training strategy by occasionally conditioning the model with a clean first-frame latent. This facilitates temporal continuity, allowing each segment during inference to leverage the output of its predecessor. To avoid visible seams caused by cropping and reinserting processed targets, particularly in scenarios involving substantial motion, we feed the original video directly into SEDiT. Thanks to one-step and chunk-wise streaming inference, our method can efficiently handle native 1440p video with infinite length.