AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
African languages—representing a significant portion of the world’s linguistic diversity—are severely underrepresented in multimodal AI, particularly in image captioning, due to scarce annotated data and limited model support. Method: This work introduces the first large-scale vision-to-language framework for 20 African languages. It constructs a semantically aligned, high-quality multilingual image-caption dataset; designs a dynamic quality assurance pipeline integrating context-aware translation, model ensembling (SigLIP + NLLB-200), and adaptive token replacement; and develops a unified, 0.5B-parameter vision-to-text architecture optimized for low-resource settings. Contribution/Results: We release the first open-source, African-language–focused image captioning dataset and corresponding pre-trained models. Our framework establishes a new multilingual generation paradigm that balances accuracy and scalability, achieving substantial performance gains on cross-modal tasks for low-resource languages. This advances inclusive, equitable multimodal AI development and sets a foundation for future research in under-resourced language modalities.

Technology Category

Application Category

📝 Abstract
Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.
Problem

Research questions and friction points this paper is trying to address.

Addressing image captioning gap for under-represented African languages
Creating scalable multilingual framework for 20 African languages
Establishing first comprehensive dataset for inclusive multimodal AI
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated Flickr8k dataset with aligned multilingual captions
Dynamic pipeline using model ensembling for quality preservation
Vision-to-text model integrating SigLIP and NLLB200 architectures
Mardiyyah Oduwole
Mardiyyah Oduwole
ML Collective
ML EfficiencyNLP for low resource languages & Social Good
P
Prince Mireku
ML Collective, Ashesi University
F
Fatimo Adebanjo
ML Collective
O
Oluwatosin Olajide
ML Collective
M
Mahi Aminu Aliyu
ML Collective, Abubakar Tafawa Balewa University
Jekaterina Novikova
Jekaterina Novikova
Vanguard Group
Natural Language ProcessingTrustworthy AIMachine Learning for Health