AraReasoner: Evaluating Reasoning-Based LLMs for Arabic NLP

📅 2025-06-10

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This study addresses the challenges posed by Arabic’s morphological complexity, dialectal diversity, and script-specific characteristics by introducing ArabEval—the first comprehensive evaluation benchmark for reasoning-oriented large language models (LLMs) in Arabic, covering 15 NLP tasks. Methodologically, it integrates few-shot prompt engineering, parameter-efficient LoRA fine-tuning, and a multi-task evaluation framework, with systematic comparisons across deep-reasoning architectures including DeepSeek and GPT-4-mini. Key contributions are: (1) the first Arabic reasoning-focused LLM benchmark; (2) demonstration that high-quality few-shot examples improve classification F1 by up to 13.2 points; (3) DeepSeek outperforms GPT-4-mini by an average of 12.0 F1 points on zero-shot complex reasoning; and (4) LoRA fine-tuning achieves 87.5% F1 (+52.2) on sentiment analysis and 87.0% F1 (+30.9) on paraphrase detection, with maximal additional gains of +8.0 F1/BLEU—substantially surpassing scale-only improvements.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable progress in reasoning abilities and general natural language processing (NLP) tasks, yet their performance on Arabic data, characterized by rich morphology, diverse dialects, and complex script, remains underexplored. This paper presents a comprehensive benchmarking study of multiple reasoning-focused LLMs, with a special emphasis on the newly introduced DeepSeek models, across a suite of fifteen Arabic NLP tasks. We experiment with various strategies, including zero-shot, few-shot, and fine-tuning. This allows us to systematically evaluate performance on datasets covering a range of applications to examine their capacity for linguistic reasoning under different levels of complexity. Our experiments reveal several key findings. First, carefully selecting just three in-context examples delivers an average uplift of over 13 F1 points on classification tasks-boosting sentiment analysis from 35.3% to 87.5% and paraphrase detection from 56.1% to 87.0%. Second, reasoning-focused DeepSeek architectures outperform a strong GPT o4-mini baseline by an average of 12 F1 points on complex inference tasks in the zero-shot setting. Third, LoRA-based fine-tuning yields up to an additional 8 points in F1 and BLEU compared to equivalent increases in model scale. The code is available at https://anonymous.4open.science/r/AraReasoner41299

Problem

Research questions and friction points this paper is trying to address.

Evaluating reasoning-based LLMs for Arabic NLP tasks

Assessing performance on Arabic data with complex linguistic features

Comparing strategies like zero-shot, few-shot, and fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmarking reasoning-focused LLMs on Arabic NLP tasks

Using zero-shot, few-shot, and fine-tuning strategies

DeepSeek models outperform GPT baseline in zero-shot

🔎 Similar Papers

No similar papers found.