AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

📅 2025-06-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the challenges posed by Arabic’s morphological complexity, dialectal diversity, and script-specific characteristics by introducing ArabEval—the first comprehensive evaluation benchmark for reasoning-oriented large language models (LLMs) in Arabic, covering 15 NLP tasks. Methodologically, it integrates few-shot prompt engineering, parameter-efficient LoRA fine-tuning, and a multi-task evaluation framework, with systematic comparisons across deep-reasoning architectures including DeepSeek and GPT-4-mini. Key contributions are: (1) the first Arabic reasoning-focused LLM benchmark; (2) demonstration that high-quality few-shot examples improve classification F1 by up to 13.2 points; (3) DeepSeek outperforms GPT-4-mini by an average of 12.0 F1 points on zero-shot complex reasoning; and (4) LoRA fine-tuning achieves 87.5% F1 (+52.2) on sentiment analysis and 87.0% F1 (+30.9) on paraphrase detection, with maximal additional gains of +8.0 F1/BLEU—substantially surpassing scale-only improvements.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299
Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning-based LLMs for Arabic NLP tasks
Assessing performance on Arabic data with complex linguistic features
Comparing strategies like zero-shot, few-shot, and fine-tuning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking reasoning-focused LLMs on Arabic NLP tasks
Using zero-shot, few-shot, and fine-tuning strategies
DeepSeek models outperform GPT baseline in zero-shot
🔎 Similar Papers
No similar papers found.
A
Ahmed Hasanaath
Department of Information and Computer Science, King Fahd University of Petroleum and Minerals, Saudi Arabia
Aisha Alansari
Aisha Alansari
Graduate Assistant, Information and Computer Science Department, KFUPM
Machine LearningNatural Language ProcessingDeep LearningLLMs
A
Ahmed Ashraf
Department of Information and Computer Science, King Fahd University of Petroleum and Minerals, Saudi Arabia
C
Chafik Salmane
Mohammed VI Polytechnic University, Ben Guerir, Morocco
H
H. Luqman
SDAIA-KFUPM Joint Research Center for Artificial Intelligence, King Fahd University of Petroleum and Minerals, Saudi Arabia
S
Saad Ezzini
Department of Information and Computer Science, King Fahd University of Petroleum and Minerals, Saudi Arabia