🤖 AI Summary
Existing video editing methods often suffer from inaccurate localization, temporal flickering, and inconsistent edits under occlusion, viewpoint changes, and rapid motion. This work proposes an occlusion-aware physical-semantic keyframe selection framework that automatically identifies optimal anchor frames by integrating three criteria: structural completeness, cycle-consistent tracking stability, and visibility of visual-language attributes. Leveraging bidirectional optical flow tracking, the method generates spatiotemporal masks to provide auxiliary supervision for diffusion models. By shifting occlusion handling from explicit reconstruction to reliable anchor frame selection, the approach achieves high-fidelity, temporally consistent video editing across multiple challenging benchmarks—without requiring manual annotations.
📝 Abstract
Video editing has recently achieved remarkable progress with diffusion-based generative models, enabling diverse object-level manipulations from natural language instructions. However, existing methods often struggle under occlusion, viewpoint changes, and fast object motion, where unreliable visual observations lead to inaccurate localization, temporal flickering, and inconsistent edits. In this work, we identify the absence of reliable visual anchors as a fundamental bottleneck in occlusion-robust video editing. To address this issue, we propose an occlusion-aware physics-semantic keyframe selection framework that automatically identifies an optimal anchor frame for downstream editing. Specifically, our method evaluates candidate frames from three complementary perspectives: structural completeness for avoiding truncated observations, cycle-consistent tracking stability for measuring physical reliability, and vision-language-based attribute visibility for ensuring semantic clarity. The selected keyframe is then propagated through bidirectional tracking to generate dense spatiotemporal masks, which are used as auxiliary supervision for a diffusion-based video editing backbone. By transforming occlusion handling from explicit reconstruction into reliable anchor selection, our framework enables precise and temporally consistent editing without requiring manual annotations. Extensive experiments on challenging video editing benchmarks demonstrate the effectiveness and high-quality performance of our method.