🤖 AI Summary
The surge in multilingual news content and the high cost of manual annotation hinder scalable media classification for low-resource languages. Method: We propose a zero-label, large language model (LLM)-driven teacher-student framework for IPTC Media Topic classification in Slovenian, Croatian, Greek, and Catalan. Leveraging GPT-series models, we employ automated prompt engineering and consistency filtering to construct high-quality multilingual training data in a zero-shot manner; a multilingual BERT variant serves as the student model, trained via knowledge distillation for efficient deployment. Contribution/Results: The teacher achieves zero-shot accuracy comparable to human-annotated baselines; the distilled student attains near-equivalent performance with only a few thousand labeled examples and demonstrates strong cross-lingual generalization. To our knowledge, this is the first open-source multilingual classifier supporting the full IPTC top-level taxonomy, establishing an extensible paradigm for news understanding in low-resource languages.
📝 Abstract
With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers’ access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news topic classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop a news topic training dataset through automatic annotation of 20,000 news articles in Slovenian, Croatian, Greek, and Catalan. Articles are classified into 17 main categories from the Media Topic schema, developed by the International Press Telecommunications Council (IPTC). The teacher model exhibits high zero-shot performance in all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual, and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.