LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content

📅 2024-10-20
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose large language models exhibit limited performance on multilingual news and social media content analysis, particularly for low-resource languages and domain-specific tasks. To address this, we introduce the first multilingual domain-specialized large language model, supporting Arabic, English, and Hindi, and uniquely integrating both domain specificity (news and social media) and multilingual capability. Methodologically, we build upon the Llama architecture and propose a systematic instruction-tuning framework encompassing 18 NLP tasks across 52 datasets, incorporating multilingual mixed sampling, domain-adapted prompt engineering, and unified task formatting. Our model achieves state-of-the-art results on 23 out of 31 evaluation benchmarks and matches SOTA on eight others. All model weights, training code, and curated datasets are publicly released on Hugging Face.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields. However, their capabilities remain limited when addressing domain-specific problems, particularly in downstream NLP tasks. Research has shown that models fine-tuned on instruction-based downstream NLP datasets outperform those that are not fine-tuned. While most efforts in this area have primarily focused on resource-rich languages like English and broad domains, little attention has been given to multilingual settings and specific domains. To address this gap, this study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context. To the best of our knowledge, this is the first attempt to tackle both domain specificity and multilinguality, with a particular focus on news and social media. Our experimental setup includes 18 tasks, represented by 52 datasets covering Arabic, English, and Hindi. We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 23 testing sets, and achieves comparable performance on 8 sets. We make the models and resources publicly available for the research community (https://huggingface.co/collections/QCRI/llamalens-672f7e0604a0498c6a2f0fe9).
Problem

Research questions and friction points this paper is trying to address.

Develop LlamaLens for multilingual content analysis
Address domain-specific NLP tasks in news
Improve multilingual social media analysis performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Specialized LLM for multilingual analysis
Focus on news and social media
Outperforms SOTA on multiple datasets
🔎 Similar Papers
No similar papers found.
M
Mohamed Bayan Kmainasi
Qatar University, Qatar
A
Ali Ezzat Shahroor
Qatar Computing Research Institute, Qatar
Maram Hasanain
Maram Hasanain
Postdoc, Qatar Computing Research Institute, HBKU
Information RetrievalSocial Media
S
Sahinur Rahman Laskar
UPES, India
Naeemul Hassan
Naeemul Hassan
University of Maryland, USA
F
Firoj Alam
Qatar Computing Research Institute, Qatar