🤖 AI Summary
To address unstable matting and blurred boundaries in portrait video matting under complex/ambiguous backgrounds and absent auxiliary information, this paper proposes a consistency-aware memory propagation mechanism based on region-adaptive memory fusion, ensuring temporal semantic coherence and precise edge delineation. Our key contributions are: (1) the first consistency-aware memory propagation module; (2) the first large-scale, high-quality video matting benchmark dataset; and (3) a transferable, large-scale segmentation-data co-training strategy integrating memory-enhanced networks, multi-stage contrastive distillation, and high-resolution temporal feature alignment. Extensive experiments demonstrate significant improvements over state-of-the-art methods across multiple benchmarks. Our approach achieves robust, high-definition matting in challenging scenarios—including dynamic backgrounds, rapid motion, and semi-transparent hair—while reducing temporal flickering by 62%.
📝 Abstract
Auxiliary-free human video matting methods, which rely solely on input frames, often struggle with complex or ambiguous backgrounds. To address this, we propose MatAnyone, a robust framework tailored for target-assigned video matting. Specifically, building on a memory-based paradigm, we introduce a consistent memory propagation module via region-adaptive memory fusion, which adaptively integrates memory from the previous frame. This ensures semantic stability in core regions while preserving fine-grained details along object boundaries. For robust training, we present a larger, high-quality, and diverse dataset for video matting. Additionally, we incorporate a novel training strategy that efficiently leverages large-scale segmentation data, boosting matting stability. With this new network design, dataset, and training strategy, MatAnyone delivers robust and accurate video matting results in diverse real-world scenarios, outperforming existing methods.