Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

📅 2026-02-18

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This study addresses the challenges of cross-lingual topic discovery in noisy multilingual social media data, using hydrogen energy discourse as a case study. It systematically evaluates four strategies—post-translation English classification, language-specific classifiers, multilingual pretrained models, and hybrid approaches—in their ability to filter irrelevant content and extract coherent topics. Through experiments on over nine million real-world tweets spanning nine years and multiple languages, the work reveals critical trade-offs between accuracy and scalability across these methods. The findings highlight the strengths and limitations of each approach under realistic conditions and propose an optimized pathway for large-scale multilingual social media analysis that balances performance with computational feasibility.

Technology Category

Application Category

📝 Abstract

Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.

Problem

Research questions and friction points this paper is trying to address.

cross-lingual classification

topic discovery

multilingual social media

noisy data filtering

natural language processing

Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual classification

multilingual transformers

topic discovery