Mining Large Language Models for Low-Resource Language Data: Comparing Elicitation Strategies for Hausa and Fongbe

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

184K/year
🤖 AI Summary
This study addresses the challenge of effectively extracting usable text data from low-resource languages, which, despite contributing to large language model training, remain underutilized. It presents the first systematic evaluation of six prompting strategies for data generation in Hausa and Fon, leveraging cross-model comparisons between GPT-4o Mini and Gemini 2.5 Flash across functional text, dialogue, and constrained generation approaches. Results demonstrate that a single query to GPT-4o Mini yields 6–41 times more target-language vocabulary than Gemini. Hausa responds best to functional and dialogue-based prompts, whereas Fon relies heavily on constrained generation. The authors release all generated corpora and code, highlighting how linguistic characteristics critically influence the selection of effective prompting strategies.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) are trained on data contributed by low-resource language communities, yet the linguistic knowledge encoded in these models remains accessible only through commercial APIs. This paper investigates whether strategic prompting can extract usable text data from LLMs for two West African languages: Hausa (Afroasiatic, approximately 80 million speakers) and Fongbe (Niger-Congo, approximately 2 million speakers). We systematically compare six elicitation task types across two commercial LLMs (GPT-4o Mini and Gemini 2.5 Flash). GPT-4o Mini extracts 6-41 times more usable target-language words per API call than Gemini. Optimal strategies differ by language: Hausa benefits from functional text and dialogue, while Fongbe requires constrained generation prompts. We release all generated corpora and code.
Problem

Research questions and friction points this paper is trying to address.

low-resource languages
language data extraction
large language models
Hausa
Fongbe
Innovation

Methods, ideas, or system contributions that make the work stand out.

prompting strategies
low-resource languages
language model elicitation
Hausa
Fongbe