Evaluating Cross-Lingual Classification Approaches Enabling Topic Discovery for Multilingual Social Media Data

📅 2026-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges of cross-lingual topic discovery in noisy multilingual social media data, using hydrogen energy discourse as a case study. It systematically evaluates four strategies—post-translation English classification, language-specific classifiers, multilingual pretrained models, and hybrid approaches—in their ability to filter irrelevant content and extract coherent topics. Through experiments on over nine million real-world tweets spanning nine years and multiple languages, the work reveals critical trade-offs between accuracy and scalability across these methods. The findings highlight the strengths and limitations of each approach under realistic conditions and propose an optimized pathway for large-scale multilingual social media analysis that balances performance with computational feasibility.

Technology Category

Application Category

📝 Abstract
Analysing multilingual social media discourse remains a major challenge in natural language processing, particularly when large-scale public debates span across diverse languages. This study investigates how different approaches for cross-lingual text classification can support reliable analysis of global conversations. Using hydrogen energy as a case study, we analyse a decade-long dataset of over nine million tweets in English, Japanese, Hindi, and Korean (2013--2022) for topic discovery. The online keyword-driven data collection results in a significant amount of irrelevant content. We explore four approaches to filter relevant content: (1) translating English annotated data into target languages for building language-specific models for each target language, (2) translating unlabelled data appearing from all languages into English for creating a single model based on English annotations, (3) applying English fine-tuned multilingual transformers directly to each target language data, and (4) a hybrid strategy that combines translated annotations with multilingual training. Each approach is evaluated for its ability to filter hydrogen-related tweets from noisy keyword-based collections. Subsequently, topic modeling is performed to extract dominant themes within the relevant subsets. The results highlight key trade-offs between translation and multilingual approaches, offering actionable insights into optimising cross-lingual pipelines for large-scale social media analysis.
Problem

Research questions and friction points this paper is trying to address.

cross-lingual classification
topic discovery
multilingual social media
noisy data filtering
natural language processing
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-lingual classification
multilingual transformers
topic discovery
social media analysis
translation-based filtering
🔎 Similar Papers
No similar papers found.
D
Deepak Uniyal
Centre for Data Science, School of Computer Science, Queensland University of Technology, Brisbane, QLD 4000, Australia
Md Abul Bashar
Md Abul Bashar
Postdoctoral Research Fellow
AI and Machine Learning
Richi Nayak
Richi Nayak
Professor, Queensland University of Technology
Data MiningPattern MiningPersonalisationText MiningXML