ROSE: Remove Objects with Side Effects in Videos

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing video object removal methods struggle to eliminate five types of environmental side effects—shadows, reflections, illumination changes, transparency, and mirroring—primarily due to the absence of large-scale paired training data with ground-truth reconstructions. To address this, we propose the first video inpainting framework for joint removal of these multi-class side effects. Our method leverages 3D rendering to automatically synthesize massive paired video data and introduces ROSE-Bench, the first dedicated benchmark for evaluating such restoration tasks. We design a diffusion-based Transformer architecture augmented with differential mask supervision to precisely localize regions affected by side effects. Crucially, our framework systematically models and suppresses all five effects in a unified manner. Extensive experiments on ROSE-Bench demonstrate significant improvements over state-of-the-art methods, with strong generalization to real-world complex scenes.

Technology Category

Application Category

📝 Abstract
Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios. The project page is https://rose2025-inpaint.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Removing object shadows and reflections in videos
Addressing scarcity of paired video data for supervision
Handling five side effect categories: shadows reflections light translucency mirror
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic data generation using 3D rendering engine
Diffusion transformer model for video inpainting
Differential mask supervision for side effect localization
C
Chenxuan Miao
Zhejiang University
Yutong Feng
Yutong Feng
Alibaba Tongyi Lab | Tsinghua University
Generative AIComputer Vision
J
Jianshu Zeng
Peking University
Z
Zixiang Gao
Peking University
H
Hantang Liu
KunByte AI
Y
Yunfeng Yan
Zhejiang University
Donglian Qi
Donglian Qi
Zhejiang University
Power systemsControl
X
Xi Chen
The University of Hong Kong
B
Bin Wang
KunByte AI
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence