Human or LLM? A Comparative Study on Accessible Code Generation Capability

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Large language models (LLMs) are increasingly used for web development, yet their generated code’s accessibility compliance—particularly regarding color contrast, alternative text, and ARIA attributes—remains poorly understood and unquantified relative to human-authored code. Method: We systematically evaluate GPT-4o and Qwen2.5-Coder across accessibility dimensions and propose FeedA11y, a feedback-driven ReAct framework integrating multi-turn self-critique, zero/few-shot prompting, and axe-core–based automated validation to overcome limitations of static prompting. Contribution/Results: Our evaluation reveals that while LLMs surpass human performance on foundational accessibility (e.g., GPT-4o achieves 91.4% color contrast compliance), they lag significantly on semantic aspects like ARIA conformance. FeedA11y improves ARIA compliance by 37.2% and exceeds human benchmarks across all core accessibility metrics. This work is the first to empirically characterize the hierarchical accessibility capabilities of LLMs and demonstrate closed-loop, automation-augmented optimization for accessible code generation.

Technology Category

Application Category

📝 Abstract

Web accessibility is essential for inclusive digital experiences, yet the accessibility of LLM-generated code remains underexplored. This paper presents an empirical study comparing the accessibility of web code generated by GPT-4o and Qwen2.5-Coder-32B-Instruct-AWQ against human-written code. Results show that LLMs often produce more accessible code, especially for basic features like color contrast and alternative text, but struggle with complex issues such as ARIA attributes. We also assess advanced prompting strategies (Zero-Shot, Few-Shot, Self-Criticism), finding they offer some gains but are limited. To address these gaps, we introduce FeedA11y, a feedback-driven ReAct-based approach that significantly outperforms other methods in improving accessibility. Our work highlights the promise of LLMs for accessible code generation and emphasizes the need for feedback-based techniques to address persistent challenges.

Problem

Research questions and friction points this paper is trying to address.

Compare accessibility of LLM-generated vs human-written web code.

Evaluate advanced prompting strategies for improving code accessibility.

Introduce FeedA11y, a feedback-driven approach to enhance accessibility.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Empirical comparison of LLM and human code accessibility

Advanced prompting strategies for accessibility improvements

FeedA11y: Feedback-driven ReAct-based accessibility enhancement

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks