🤖 AI Summary
This study addresses the underperformance of large language models (LLMs) on morphosyntactic tagging and dependency parsing tasks in Modern Standard Arabic, a language characterized by complex morphology and orthographic ambiguity. It presents the first systematic evaluation of instruction-tuned LLMs under zero-shot prompting and retrieval-augmented in-context learning (ICL) settings. Through carefully designed prompts and example selection strategies, experiments are conducted on the Arabic Treebank. Results show that LLMs achieve near-supervised performance on morphological feature tagging and match specialized parsers in dependency parsing. Notably, retrieval-based ICL substantially improves tokenization, tagging, and parsing accuracy on raw text, highlighting both the potential and limitations of LLMs in handling intricate morphosyntactic interactions.
📝 Abstract
Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.