Learning Stochastic Bridges for Video Object Removal via Video-to-Video Translation

πŸ“… 2026-01-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing video object removal methods predominantly rely on diffusion models initialized from Gaussian noise, often neglecting the structural and contextual priors inherent in the original video, which can lead to incomplete removal or physically implausible generations. This work reframes the task as a video-to-video translation problem and introduces a stochastic bridge–based generative framework that leverages the input video as a strong structural prior. To dynamically balance background fidelity and generative flexibility, the method incorporates an adaptive mask modulation mechanism. The proposed approach significantly improves visual quality and temporal consistency, particularly for large-scale object removal, yielding more complete and semantically coherent results that adhere to both scene context and physical plausibility.

Technology Category

Application Category

πŸ“ Abstract
Existing video object removal methods predominantly rely on diffusion models following a noise-to-data paradigm, where generation starts from uninformative Gaussian noise. This approach discards the rich structural and contextual priors present in the original input video. Consequently, such methods often lack sufficient guidance, leading to incomplete object erasure or the synthesis of implausible content that conflicts with the scene's physical logic. In this paper, we reformulate video object removal as a video-to-video translation task via a stochastic bridge model. Unlike noise-initialized methods, our framework establishes a direct stochastic path from the source video (with objects) to the target video (objects removed). This bridge formulation effectively leverages the input video as a strong structural prior, guiding the model to perform precise removal while ensuring that the filled regions are logically consistent with the surrounding environment. To address the trade-off where strong bridge priors hinder the removal of large objects, we propose a novel adaptive mask modulation strategy. This mechanism dynamically modulates input embeddings based on mask characteristics, balancing background fidelity with generative flexibility. Extensive experiments demonstrate that our approach significantly outperforms existing methods in both visual quality and temporal consistency. The project page is https://bridgeremoval.github.io/.
Problem

Research questions and friction points this paper is trying to address.

video object removal
diffusion models
structural priors
video-to-video translation
temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

stochastic bridge
video-to-video translation
adaptive mask modulation
video object removal
structural prior
πŸ”Ž Similar Papers
No similar papers found.
Z
Zijie Lou
MT Lab, Meitu Inc., Beijing 100083, China
X
Xiangwei Feng
MT Lab, Meitu Inc., Beijing 100083, China
Jiaxin Wang
Jiaxin Wang
Anhui University of Science and Technology
deep learning semi-supervised learning
X
Xiaochao Qu
MT Lab, Meitu Inc., Beijing 100083, China
Luoqi Liu
Luoqi Liu
Director of MT Lab; Meitu
Computer Vision
T
Ting Liu
MT Lab, Meitu Inc., Beijing 100083, China