AFRICAPTION: Establishing a New Paradigm for Image Captioning in African Languages

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

African languages—representing a significant portion of the world’s linguistic diversity—are severely underrepresented in multimodal AI, particularly in image captioning, due to scarce annotated data and limited model support. Method: This work introduces the first large-scale vision-to-language framework for 20 African languages. It constructs a semantically aligned, high-quality multilingual image-caption dataset; designs a dynamic quality assurance pipeline integrating context-aware translation, model ensembling (SigLIP + NLLB-200), and adaptive token replacement; and develops a unified, 0.5B-parameter vision-to-text architecture optimized for low-resource settings. Contribution/Results: We release the first open-source, African-language–focused image captioning dataset and corresponding pre-trained models. Our framework establishes a new multilingual generation paradigm that balances accuracy and scalability, achieving substantial performance gains on cross-modal tasks for low-resource languages. This advances inclusive, equitable multimodal AI development and sets a foundation for future research in under-resourced language modalities.

Technology Category

Application Category

📝 Abstract

Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across under-represented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for under-represented African languages, laying the groundwork for truly inclusive multimodal AI.

Problem

Research questions and friction points this paper is trying to address.

Addressing image captioning gap for under-represented African languages

Creating scalable multilingual framework for 20 African languages

Establishing first comprehensive dataset for inclusive multimodal AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curated Flickr8k dataset with aligned multilingual captions

Dynamic pipeline using model ensembling for quality preservation

Vision-to-text model integrating SigLIP and NLLB200 architectures

🔎 Similar Papers

Surveying the Landscape of Image Captioning Evaluation: A Comprehensive Taxonomy, Trends and Metrics Analysis