BloomAPR: A Bloom's Taxonomy-based Framework for Assessing the Capabilities of LLM-Powered APR Solutions

📅 2025-09-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based automated program repair (APR) evaluation suffers from two key limitations: static benchmarks (e.g., Defects4J) risk data contamination and fail to characterize cognitive differences across reasoning capabilities. To address this, we propose BloomAPR—the first dynamic APR evaluation framework grounded in Bloom’s Taxonomy—introducing educational cognitive theory into APR assessment. It establishes four hierarchical capability metrics: *Remember*, *Understand*, *Apply*, and *Analyze*, and integrates synthetic defects with real-world scenarios for stratified testing. Experiments on Defects4J using ChatRepair and CigaR with GPT-3.5-Turbo, Llama-3.1, and StarCoder-2 reveal that repair accuracy drops sharply from 81.57% at the *Remember* level to only 13.46%–41.34% at the *Analyze* level. This pronounced decline exposes the over-optimistic bias inherent in static evaluation and empirically validates the necessity and effectiveness of dynamic, cognitively grounded, layered assessment.

Technology Category

Application Category

📝 Abstract
Recent advances in large language models (LLMs) have accelerated the development of AI-driven automated program repair (APR) solutions. However, these solutions are typically evaluated using static benchmarks such as Defects4J and SWE-bench, which suffer from two key limitations: (1) the risk of data contamination, potentially inflating evaluation results due to overlap with LLM training data, and (2) limited ability to assess the APR capabilities in dynamic and diverse contexts. In this paper, we introduced BloomAPR, a novel dynamic evaluation framework grounded in Bloom's Taxonomy. Our framework offers a structured approach to assess the cognitive capabilities of LLM-powered APR solutions across progressively complex reasoning levels. Using Defects4J as a case study, we evaluated two state-of-the-art LLM-powered APR solutions, ChatRepair and CigaR, under three different LLMs: GPT-3.5-Turbo, Llama-3.1, and StarCoder-2. Our findings show that while these solutions exhibit basic reasoning skills and effectively memorize bug-fixing patterns (fixing up to 81.57% of bugs at the Remember layer), their performance increases with synthetically generated bugs (up to 60.66% increase at the Understand layer). However, they perform worse on minor syntactic changes (fixing up to 43.32% at the Apply layer), and they struggle to repair similar bugs when injected into real-world projects (solving only 13.46% to 41.34% bugs at the Analyze layer). These results underscore the urgent need for evolving benchmarks and provide a foundation for more trustworthy evaluation of LLM-powered software engineering solutions.
Problem

Research questions and friction points this paper is trying to address.

Assessing LLM-powered program repair solutions' cognitive capabilities across reasoning levels
Addressing limitations of static benchmarks like data contamination and dynamic context evaluation
Evaluating performance degradation on syntactic changes and real-world bug repair scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

BloomAPR framework uses Bloom's Taxonomy for evaluation
It assesses APR solutions across progressive cognitive levels
Framework dynamically tests bug-fixing in real-world contexts
🔎 Similar Papers
No similar papers found.