π€ AI Summary
Existing multilingual dialogue benchmarks exhibit strong bias toward high-resource and Western-centric languages, neglecting cultural appropriateness and linguistic diversity in low-resource African languages.
Method: We introduce AFRICA-NLUβthe first open-source, fully localized intent classification and slot filling benchmark covering 16 African languages, with utterances authored by native speakers and grounded in authentic scenarios (e.g., banking, travel). Unlike translation-based approaches, it employs a hybrid annotation pipeline combining multi-round human verification and GPT-4o-assisted labeling.
Contribution/Results: We conduct the first systematic evaluation of mainstream LLMs and fine-tuned models on African languages, revealing substantial performance gaps: GPT-4o achieves only 26.0 F1 for slot filling and 70.6% intent accuracy, whereas a culturally adapted multilingual Transformer model attains 81.2% F1 and 85.7% accuracy. These results demonstrate that natively collected, culturally grounded data yields critical gains for cross-lingual NLU transfer, establishing a new paradigm for low-resource language evaluation and modeling.
π Abstract
Slot-filling and intent detection are well-established tasks in Conversational AI. However, current large-scale benchmarks for these tasks often exclude evaluations of low-resource languages and rely on translations from English benchmarks, thereby predominantly reflecting Western-centric concepts. In this paper, we introduce Injongo -- a multicultural, open-source benchmark dataset for 16 African languages with utterances generated by native speakers across diverse domains, including banking, travel, home, and dining. Through extensive experiments, we benchmark the fine-tuning multilingual transformer models and the prompting large language models (LLMs), and show the advantage of leveraging African-cultural utterances over Western-centric utterances for improving cross-lingual transfer from the English language. Experimental results reveal that current LLMs struggle with the slot-filling task, with GPT-4o achieving an average performance of 26 F1-score. In contrast, intent detection performance is notably better, with an average accuracy of 70.6%, though it still falls behind the fine-tuning baselines. Compared to the English language, GPT-4o and fine-tuning baselines perform similarly on intent detection, achieving an accuracy of approximately 81%. Our findings suggest that the performance of LLMs is still behind for many low-resource African languages, and more work is needed to further improve their downstream performance.