🤖 AI Summary
English dominates the web, yet multilingual web pages—featuring mixed English and local languages in visible content and metadata—are increasingly prevalent. However, poor screen reader support for non-Latin scripts, erroneous speech synthesis, and inconsistent or missing language annotations severely impair accessibility for blind and low-vision users. Prior work is hindered by the absence of large-scale, multilingual web accessibility datasets. To address this, we introduce LangCrUX, the first large-scale multilingual web accessibility dataset covering 120,000 websites. We further propose Kizuki, a language-aware automated detection tool implemented as a browser extension, integrating multilingual text detection, DOM parsing, and assistive technology behavior simulation. Our study is the first to systematically quantify the prevalence of language annotation inconsistencies across multilingual pages and empirically demonstrates Kizuki’s significantly improved detection accuracy over existing methods. These contributions advance standardization and practical deployment of multilingual web accessibility evaluation.
📝 Abstract
English is the predominant language on the web, powering nearly half of the world's top ten million websites. Support for multilingual content is nevertheless growing, with many websites increasingly combining English with regional or native languages in both visible content and hidden metadata. This multilingualism introduces significant barriers for users with visual impairments, as assistive technologies like screen readers frequently lack robust support for non-Latin scripts and misrender or mispronounce non-English text, compounding accessibility challenges across diverse linguistic contexts. Yet, large-scale studies of this issue have been limited by the lack of comprehensive datasets on multilingual web content. To address this gap, we introduce LangCrUX, the first large-scale dataset of 120,000 popular websites across 12 languages that primarily use non-Latin scripts. Leveraging this dataset, we conduct a systematic analysis of multilingual web accessibility and uncover widespread neglect of accessibility hints. We find that these hints often fail to reflect the language diversity of visible content, reducing the effectiveness of screen readers and limiting web accessibility. We finally propose Kizuki, a language-aware automated accessibility testing extension to account for the limited utility of language-inconsistent accessibility hints.