๐ค AI Summary
This study addresses the text comprehension challenges faced by individuals with intellectual disabilities by proposing an accessibility-oriented visionโlanguage co-optimization method. To this end, we design five structured prompt templates aligned with Web Content Accessibility Guidelines (WCAG), which map simplified textual inputs to highly comprehensible images. We systematically investigate the interplay among visual style, data source, and semantic alignment. Experiments are conducted on a sentence-level dataset of 400 samples, evaluated via both CLIPScore-based automatic metrics and expert human annotation. Results indicate that the *Basic Object Focus* template achieves optimal semantic alignment; the *Retro* visual style significantly enhances image comprehensibility; and Wikipedia proves the most suitable data source for accessibility objectives. This work delivers the first reproducible prompt engineering framework and empirically grounded design guidelines for AI-driven accessible content generation.
๐ Abstract
Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.