🤖 AI Summary
This work addresses the systematic calibration bias in watch-time prediction for short-video recommendation, where the long-tailed distribution of watch durations leads regression models to consistently overestimate short views and underestimate long ones. To mitigate this without replacing existing predictors, the authors propose the Distribution-Aware Debiasing Framework (DADF), which applies a second-stage multiplicative residual correction. DADF innovatively integrates dynamic distribution-aware target transformation, residual modeling conditioned on inference-time observable factors such as video duration, multi-label auxiliary signals, and a multiplicative correction mechanism to effectively capture heterogeneous residual patterns. Experiments demonstrate that DADF significantly improves both point prediction accuracy and ranking performance on public benchmarks and industrial-scale systems, yielding a 0.347% increase in average watch time per user, a 1.88 percentage point gain in WUAUC, and a 12.57% reduction in MAE in online A/B tests.
📝 Abstract
Watch-time prediction is a central regression task in short-video recommender systems, where labels are highly long-tailed and residual errors vary systematically across observed watch-time regions. In practice, a model may appear globally calibrated while still overestimating short views and underestimating long views, because opposite errors cancel out in aggregate. Existing methods mainly improve the first-stage watch-time predictor, but often leave such residual distributional bias insufficiently corrected. We propose DADF, a distribution-aware debiasing framework for watch-time regression. Instead of replacing a deployed predictor, DADF performs second-stage multiplicative residual correction on top of it. DADF combines three complementary designs: a dynamic distribution-aware transformation for stabilizing long-tailed correction targets, a debias-factor-aware module for modeling heterogeneous residual patterns using inference-time observable factors, especially video duration, and a multi-label-aware module that exploits auxiliary prediction signals from engagement heads. We evaluate DADF on public short-video benchmarks and a large-scale industrial ranking system. DADF consistently improves both pointwise accuracy and ranking quality across datasets and backbones. In the industrial setting, it achieves a 1.88 percentage-point WUAUC gain over the production baseline, reduces MAE by 12.57%, and yields a statistically significant 0.347% lift in average time spent per device in online A/B testing. These results demonstrate that DADF effectively mitigates local calibration bias and provides a practical plug-in solution for debiasing long-tailed continuous targets. The source code is available at https://github.com/liuzhao09/DADF.