🤖 AI Summary
To address the challenge of deploying image captioning models for low-resource languages—such as Assamese—on resource-constrained edge devices, this work introduces the first end-to-end lightweight Assamese image description model. Methodologically, it replaces the computationally expensive Faster R-CNN with ShuffleNetV2×1.5 as the visual backbone and employs a GRU-based decoder augmented with bilinear attention to preserve semantic fidelity while drastically reducing computational overhead. The model is fine-tuned on our newly constructed COO-AC Assamese image captioning dataset and the COCO-AC benchmark, achieving a CIDEr score of 82.3. It requires only 1.098 GFLOPs and 25.65M parameters, enabling real-time inference on edge devices. This work bridges a critical gap in on-device Assamese visual understanding and establishes a reusable, lightweight paradigm for advancing AI accessibility in low-resource languages.
📝 Abstract
Neural networks have significantly advanced AI applications, yet their real-world adoption remains constrained by high computational demands, hardware limitations, and accessibility challenges. In image captioning, many state-of-the-art models have achieved impressive performances while relying on resource-intensive architectures. This made them impractical for deployment on resource-constrained devices. This limitation is particularly noticeable for applications involving low-resource languages. We demonstrate the case of image captioning in Assamese language, where lack of effective, scalable systems can restrict the accessibility of AI-based solutions for native Assamese speakers. This work presents AC-Lite, a computationally efficient model for image captioning in low-resource Assamese language. AC-Lite reduces computational requirements by replacing computation-heavy visual feature extractors like FasterRCNN with lightweight ShuffleNetv2x1.5. Additionally, Gated Recurrent Units (GRUs) are used as the caption decoder to further reduce computational demands and model parameters. Furthermore, the integration of bilinear attention enhances the model's overall performance. AC-Lite can operate on edge devices, thereby eliminating the need for computation on remote servers. The proposed AC-Lite model achieves 82.3 CIDEr score on the COCO-AC dataset with 1.098 GFLOPs and 25.65M parameters.