🤖 AI Summary
Arabic dialects form a linguistic continuum, yet most NLP models treat them as discrete categories; existing continuous measures (e.g., ALDi) capture dialectality along only one dimension and fail to reflect the cross-dialectal lexical distribution breadth. Method: We propose the Arabic Generality Score (AGS), a novel orthogonal dimension for modeling dialectality that quantifies how widely a lexical item is distributed across multiple dialects. Leveraging parallel corpora, we design a fine-grained, word-level AGS annotation pipeline integrating word alignment, etymology-aware edit distance, and smoothing strategies, then train a context-aware regression model for automatic AGS prediction. Contribution/Results: AGS is both linguistically principled and computationally scalable. Experiments demonstrate that incorporating AGS significantly improves performance on multi-dialect benchmarks—outperforming state-of-the-art dialect identification systems—and establishes an interpretable, reusable lexical metric paradigm for modeling Arabic dialectal continuity.
📝 Abstract
Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness.