🤖 AI Summary
To address the scarcity of high-quality annotations, insufficient cross-modal fusion, and underutilization of audio information in short-video multimodal recommendation, this paper proposes two core techniques: (1) kNN-guided Latent Space Bootstrapping (LSB) to enhance the robustness of active learning under low-resource conditions; and (2) VLMAE-informed architecture, which lightweightly incorporates audio features into a pre-trained vision-language backbone—enabling efficient audio distillation and mid-fusion without retraining. The approach integrates Transformer-based multimodal modeling, latent-space perturbation-alignment, and kNN retrieval. Evaluated on industrial-scale recommendation and advertising systems, it significantly improves high-quality playback rate and advertising revenue. Results demonstrate the effectiveness, scalability, and engineering practicality of both audio enhancement and data expansion strategies.
📝 Abstract
Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.