π€ AI Summary
Deploying video matting models on edge devices faces dual challenges: quantization-induced accuracy degradation and temporal inconsistency. This paper introduces the first post-training quantization (PTQ) framework tailored for video matting, featuring three key innovations: (1) a two-stage PTQ strategy that decouples spatial and temporal quantization; (2) a statistics-driven Global Affine Calibration (GAC) to mitigate distribution shift under ultra-low-bit (e.g., β€4-bit) quantization; and (3) an Optical-Flow-guided local-global collaborative quantization mechanism (OFA) that explicitly enforces inter-frame consistency. The method supports 4-bit and lower precision quantization, achieving near-full-precision performance across multiple benchmarks (ΞF-score < 0.01), reducing FLOPs by 8Γ, and decreasing temporal jitter by 62%. It outperforms existing state-of-the-art quantization approaches in both accuracy and temporal stability.
π Abstract
Video matting is crucial for applications such as film production and virtual reality, yet deploying its computationally intensive models on resource-constrained devices presents challenges. Quantization is a key technique for model compression and acceleration. As an efficient approach, Post-Training Quantization (PTQ) is still in its nascent stages for video matting, facing significant hurdles in maintaining accuracy and temporal coherence. To address these challenges, this paper proposes a novel and general PTQ framework specifically designed for video matting models, marking, to the best of our knowledge, the first systematic attempt in this domain. Our contributions include: (1) A two-stage PTQ strategy that combines block-reconstruction-based optimization for fast, stable initial quantization and local dependency capture, followed by a global calibration of quantization parameters to minimize accuracy loss. (2) A Statistically-Driven Global Affine Calibration (GAC) method that enables the network to compensate for cumulative statistical distortions arising from factors such as neglected BN layer effects, even reducing the error of existing PTQ methods on video matting tasks up to 20%. (3) An Optical Flow Assistance (OFA) component that leverages temporal and semantic priors from frames to guide the PTQ process, enhancing the model's ability to distinguish moving foregrounds in complex scenes and ultimately achieving near full-precision performance even under ultra-low-bit quantization. Comprehensive quantitative and visual results show that our PTQ4VM achieves the state-of-the-art accuracy performance across different bit-widths compared to the existing quantization methods. We highlight that the 4-bit PTQ4VM even achieves performance close to the full-precision counterpart while enjoying 8x FLOP savings.