Efficient Cross-Architecture Knowledge Transfer for Large-Scale Online User Response Prediction

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

This work addresses the performance degradation in large-scale user response prediction systems during model architecture transitions, caused by high retraining costs and data retention constraints. To this end, we propose CrossAdapt, a two-stage transfer framework. In the offline phase, it enables rapid embedding migration via dimension-adaptive projection and reduces computational overhead through progressive network distillation and strategic sampling. In the online phase, it employs asymmetric co-distillation and a distribution-aware adaptation mechanism to balance historical knowledge preservation with rapid adaptation to new data. Notably, CrossAdapt introduces the first non-iterative embedding transfer strategy, effectively tackling the challenges of heterogeneous architectures and massive embedding table migration. Experiments show AUC improvements of 0.27–0.43% and training time reductions of 43–71% across three public datasets, while significantly mitigating AUC drops, LogLoss increases, and prediction bias in WeChat Channels’ tens-of-millions daily active user scenario.

Technology Category

Application Category

📝 Abstract

Deploying new architectures in large-scale user response prediction systems incurs high model switching costs due to expensive retraining on massive historical data and performance degradation under data retention constraints. Existing knowledge distillation methods struggle with architectural heterogeneity and the prohibitive cost of transferring large embedding tables. We propose CrossAdapt, a two-stage framework for efficient cross-architecture knowledge transfer. The offline stage enables rapid embedding transfer via dimension-adaptive projections without iterative training, combined with progressive network distillation and strategic sampling to reduce computational cost. The online stage introduces asymmetric co-distillation, where students update frequently while teachers update infrequently, together with a distribution-aware adaptation mechanism that dynamically balances historical knowledge preservation and fast adaptation to evolving data. Experiments on three public datasets show that CrossAdapt achieves 0.27-0.43% AUC improvements while reducing training time by 43-71%. Large-scale deployment on Tencent WeChat Channels (~10M daily samples) further demonstrates its effectiveness, significantly mitigating AUC degradation, LogLoss increase, and prediction bias compared to standard distillation baselines.

Problem

Research questions and friction points this paper is trying to address.

cross-architecture knowledge transfer

user response prediction

model switching cost

embedding transfer

large-scale deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-architecture knowledge transfer

embedding transfer

asymmetric co-distillation