🤖 AI Summary
This work addresses the limitations of dual-tower recommendation models, which suffer from restricted representational capacity, insufficient embedding space alignment, and absent cross-tower feature interaction due to architectural isolation, making it challenging to achieve high effectiveness under low-latency constraints. To overcome these issues, the authors propose CS3, a novel framework that enables online collaborative optimization of dual-tower models within millisecond-level latency budgets. CS3 integrates cyclic adaptive feature denoising, a lightweight cross-tower mutual-aware synchronization mechanism, and cascaded inter-stage knowledge sharing to enhance model collaboration. Designed as a plug-and-play module, CS3 is compatible with diverse backbone architectures and online learning paradigms. Experiments on three public benchmarks demonstrate significant improvements over strong baselines, and deployment in a large-scale advertising system yields an 8.36% revenue gain while maintaining real-time responsiveness.
📝 Abstract
To balance effectiveness and efficiency in recommender systems, multi-stage pipelines commonly use lightweight two-tower models for large-scale candidate retrieval. However, the isolated two-tower architecture restricts representation capacity, embedding-space alignment, and cross-feature interactions. Existing solutions such as late interaction and knowledge distillation can mitigate these issues, but often increase latency or are difficult to deploy in online learning settings. We propose Capability Synergy (CS3), an efficient online framework that strengthens two-tower retrievers while preserving real-time constraints. CS3 introduces three mechanisms: (1) Cycle-Adaptive Structure for self-revision via adaptive feature denoising within each tower; (2) Cross-Tower Synchronization to improve alignment through lightweight mutual awareness between towers; and (3) Cascade-Model Sharing to enhance cross-stage consistency by reusing knowledge from downstream models. CS3 is plug-and-play with diverse two-tower backbones and compatible with online learning. Experiments on three public datasets show consistent gains over strong baselines, and deployment in a largescale advertising system yields up to 8.36% revenue improvement across three scenarios while maintaining ms-level latency.