🤖 AI Summary
Existing evaluation metrics for image and video object removal often misalign with human perception, frequently favoring results that rely on copy-pasting, exhibit blurriness, or overlook local artifacts. To address this limitation, this work proposes a perceptually aligned Removal Consistency (RC) evaluation framework, comprising a spatial consistency metric (RC-S) based on sliding-window feature comparisons and a temporal consistency metric (RC-T) that tracks shared region distributions across frames. The authors also introduce PROVE-Bench, a comprehensive benchmark encompassing both paired and unpaired subsets to reflect real-world scenarios and challenging cases. Extensive experiments demonstrate that RC significantly outperforms existing metrics across multiple benchmarks and achieves high correlation with human judgments. The code and dataset are publicly released.
📝 Abstract
Evaluating object removal in images and videos remains challenging because the task is inherently one-to-many, yet existing metrics frequently disagree with human perception. Full-reference metrics reward copy-paste behaviors over genuine erasure; no-reference metrics suffer from systematic biases such as favoring blurry results; and global temporal metrics are insensitive to localized artifacts within edited regions. To address these limitations, we propose RC (Removal Coherence), a pair of perception-aligned metrics: RC-S, which measures spatial coherence via sliding-window feature comparison between masked and background regions, and RC-T, which measures temporal consistency via distribution tracking within shared restored regions across adjacent frames. To validate RC and support community benchmarking, we further introduce PROVE-Bench, a two-tier real-world benchmark comprising PROVE-M, an 80-video paired dataset with motion augmentation, and PROVE-H, a 100-video challenging subset without ground truth. Together, RC metrics and PROVE-Bench form the PROVE (Perceptual RemOVal cohErence) evaluation framework for visual media. Experiments across diverse image and video benchmarks demonstrate that RC achieves substantially stronger alignment with human judgments than existing evaluation protocols. The code for RC metrics and PROVE-Bench are publicly available at: https://github.com/xiaomi-research/prove/.