🤖 AI Summary
EPUB e-books commonly lack accurate alt text for images, severely impeding accessibility for visually impaired users. To address this, we propose an end-to-end multimodal AI pipeline—the first to jointly leverage CLIP/ViT visual features and surrounding textual context for context-aware, linguistically coherent alt text generation. Our method comprises multi-stage feature alignment, Transformer-based text generation, and a hybrid evaluation framework co-designed with visually impaired users—integrating objective metrics (e.g., BLEU, cosine similarity) with iterative usability feedback. Experiments demonstrate a 97.5% reduction in accessibility errors, alongside significant improvements in BLEU (+24.3%) and visual–semantic cosine similarity (+31.7%). User studies confirm substantial gains in document comprehension and interactive usability, with our approach outperforming state-of-the-art baselines across all dimensions.
📝 Abstract
Digital accessibility is a cornerstone of inclusive content delivery, yet many EPUB files fail to meet fundamental accessibility standards, particularly in providing descriptive alt text for images. Alt text plays a critical role in enabling visually impaired users to understand visual content through assistive technologies. However, generating high-quality alt text at scale is a resource-intensive process, creating significant challenges for organizations aiming to ensure accessibility compliance. This paper introduces AltGen, a novel AI-driven pipeline designed to automate the generation of alt text for images in EPUB files. By integrating state-of-the-art generative models, including advanced transformer-based architectures, AltGen achieves contextually relevant and linguistically coherent alt text descriptions. The pipeline encompasses multiple stages, starting with data preprocessing to extract and prepare relevant content, followed by visual analysis using computer vision models such as CLIP and ViT. The extracted visual features are enriched with contextual information from surrounding text, enabling the fine-tuned language models to generate descriptive and accurate alt text. Validation of the generated output employs both quantitative metrics, such as cosine similarity and BLEU scores, and qualitative feedback from visually impaired users. Experimental results demonstrate the efficacy of AltGen across diverse datasets, achieving a 97.5% reduction in accessibility errors and high scores in similarity and linguistic fidelity metrics. User studies highlight the practical impact of AltGen, with participants reporting significant improvements in document usability and comprehension. Furthermore, comparative analyses reveal that AltGen outperforms existing approaches in terms of accuracy, relevance, and scalability.