🤖 AI Summary
Traditional machine learning struggles with fine-grained semantic understanding of short videos—e.g., mood and emotion—due to lengthy development cycles and limited representational capacity. To address this, we propose a novel “LLM-as-annotator” paradigm, leveraging large language models as scalable, high-accuracy automatic annotation engines. Our approach fuses multimodal features, incorporates reasoning optimization and knowledge distillation, and enables high-quality offline batch annotation of fine-grained video attributes. Through iterative definition–evaluation cycles, it surpasses human annotation quality in offline evaluation. Online A/B testing demonstrates significant improvements in user engagement and satisfactory consumption behaviors when integrated into a personalized retrieval system. This work presents the first systematic validation of LLMs’ annotation efficacy and deployment feasibility for industrial-scale short-video understanding, offering both methodological innovation and engineering scalability.
📝 Abstract
This paper presents a case study on deploying Large Language Models (LLMs) as an advanced "annotation" mechanism to achieve nuanced content understanding (e.g., discerning content "vibe") at scale within a large-scale industrial short-form video recommendation system. Traditional machine learning classifiers for content understanding face protracted development cycles and a lack of deep, nuanced comprehension. The "LLM-as-annotators" approach addresses these by significantly shortening development times and enabling the annotation of subtle attributes. This work details an end-to-end workflow encompassing: (1) iterative definition and robust evaluation of target attributes, refined by offline metrics and online A/B testing; (2) scalable offline bulk annotation of video corpora using LLMs with multimodal features, optimized inference, and knowledge distillation for broad application; and (3) integration of these rich annotations into the online recommendation serving system, for example, through personalized restrict retrieval. Experimental results demonstrate the efficacy of this approach, with LLMs outperforming human raters in offline annotation quality for nuanced attributes and yielding significant improvements of user participation and satisfied consumption in online A/B tests. The study provides insights into designing and scaling production-level LLM pipelines for rich content evaluation, highlighting the adaptability and benefits of LLM-generated nuanced understanding for enhancing content discovery, user satisfaction, and the overall effectiveness of modern recommendation systems.