Learning the Topic, Not the Language: How LLMs Classify Online Immigration Discourse Across Languages

📅 2025-08-08

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This study investigates whether lightweight fine-tuning of large language models (LLMs) on low-resource languages enables cross-lingual transfer of immigration-related topic identification to other languages encountered during pretraining but not fine-tuned. Method: We propose a low-resource cross-lingual adaptation framework based on LLaMA 3.2-3B for culturally sensitive, polarized immigration tweets on social media, employing LoRA and 4-bit quantization to achieve robust topic classification with minimal monolingual or multilingual annotations. Contribution/Results: Lightweight fine-tuning effectively mitigates pretrained language bias, yielding robust topic classification across 13 languages. Multilingual fine-tuning substantially improves stance detection performance while accelerating inference by 35× and reducing cost to just 0.00000989% of GPT-4o’s. Our findings challenge the prevailing assumption that cross-lingual capability necessitates large-scale multilingual training. All models and code are publicly released.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are transforming social-science research by enabling scalable, precise analysis. Their adaptability raises the question of whether knowledge acquired through fine-tuning in a few languages can transfer to unseen languages that only appeared during pre-training. To examine this, we fine-tune lightweight LLaMA 3.2-3B models on monolingual, bilingual, or multilingual data sets to classify immigration-related tweets from X/Twitter across 13 languages, a domain characterised by polarised, culturally specific discourse. We evaluate whether minimal language-specific fine-tuning enables cross-lingual topic detection and whether adding targeted languages corrects pre-training biases. Results show that LLMs fine-tuned in one or two languages can reliably classify immigration-related content in unseen languages. However, identifying whether a tweet expresses a pro- or anti-immigration stance benefits from multilingual fine-tuning. Pre-training bias favours dominant languages, but even minimal exposure to under-represented languages during fine-tuning (as little as $9.62 imes10^{-11}$ of the original pre-training token volume) yields significant gains. These findings challenge the assumption that cross-lingual mastery requires extensive multilingual training: limited language coverage suffices for topic-level generalisation, and structural biases can be corrected with lightweight interventions. By releasing 4-bit-quantised, LoRA fine-tuned models, we provide an open-source, reproducible alternative to proprietary LLMs that delivers 35 times faster inference at just 0.00000989% of the dollar cost of the OpenAI GPT-4o model, enabling scalable, inclusive research.

Problem

Research questions and friction points this paper is trying to address.

Examining cross-lingual topic detection in LLMs with minimal fine-tuning

Assessing pre-training biases and corrective multilingual fine-tuning effects

Developing efficient open-source models for scalable immigration discourse analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-tune LLaMA models for multilingual topic detection

Use minimal language-specific data to correct biases

Release efficient 4-bit-quantized LoRA fine-tuned models

🔎 Similar Papers

No similar papers found.