🤖 AI Summary
Conventional evaluation of learned video compression via direct averaging of rate-distortion (RD) curves across test sequences introduces systematic bias—outlier sequences disproportionately influence the mean curve, obscuring true codec performance on the majority of sequences.
Method: The authors systematically analyze this issue through analytical modeling, empirical validation on the UVG dataset, and comparative BD-rate analysis.
Contribution/Results: They demonstrate that rankings derived from averaged RD curves frequently contradict those obtained from per-sequence metric averaging (e.g., PSNR or MS-SSIM), revealing fundamental inconsistencies in current practice. The paper advocates a return to the traditional video coding standard: per-sequence RD analysis followed by arithmetic averaging of distortion metrics at fixed bitrates (or BD-rate). Empirical results confirm this approach yields more robust, consistent, and statistically reliable performance assessments. This work establishes both theoretical justification and practical guidelines for redefining evaluation paradigms in learned video compression.
📝 Abstract
This paper aims to demonstrate how the prevalent practice in the learned video compression community of averaging rate-distortion (RD) curves across a test video set can lead to misleading conclusions in evaluating codec performance. Through analytical analysis of a simple case and experimental results with two recent learned video codecs, we show how averaged RD curves can mislead comparative evaluation of different codecs, particularly when videos in a dataset have varying characteristics and operating ranges. We illustrate how a single video with distinct RD characteristics from the rest of the test set can disproportionately influence the average RD curve, potentially overshadowing a codec's superior performance across most individual sequences. Using two recent learned video codecs on the UVG dataset as a case study, we demonstrate computing performance metrics, such as the BD rate, from the average RD curve suggests conclusions that contradict those reached from calculating the average of per-sequence metrics. Hence, we argue that the learned video compression community should also report per-sequence RD curves and performance metrics for a test set should be computed from the average of per-sequence metrics, similar to the established practice in traditional video coding, to ensure fair and accurate codec comparisons.