Foundation Models as Oracles for Refactoring Correctness Detection

📅 2026-05-03

📈 Citations: 0

✨ Influential: 0

career value

159K/year

🤖 AI Summary

This work proposes a novel approach to detecting behavioral and compilation errors introduced by automated refactoring tools, which are often missed by traditional verification methods due to their limited adaptability and inability to capture subtle defects. For the first time, foundation models are employed as zero-shot oracles to assess the correctness of 47 distinct Java refactoring types without requiring task-specific training or handcrafted rules. By integrating both open- and closed-source models—including GPT, Gemma, and Gemini—and leveraging metamorphic testing to validate semantic consistency, the method achieves 93.8% accuracy with GPT-5.4 and demonstrates peak performance with Gemini-3.1-Pro-Preview. The predictions exhibit strong consistency and interpretability, positioning these models as lightweight yet effective assistants for enhancing software reliability during development.

📝 Abstract

Refactoring tools in popular Integrated Development Environments (IDEs) can introduce unintended behavioral changes or compilation errors, a persistent challenge that undermines developer trust in automated transformations. Traditional detection approaches rely on handcrafted preconditions, and static and dynamic analyses, yet remain limited in adaptability and can miss subtle correctness issues. This study examines the potential of foundation models to serve as oracles for detecting refactoring bugs in Java programs. We evaluate zero-shot prompting, without task-specific training, across 226 real refactoring bugs collected over more than a decade from widely used Java IDEs (IntelliJ-IDEA, Eclipse, and NetBeans), spanning 47 refactoring types. Our results indicate that foundation models can be effective for this task, although performance varies across models. In the first-run setting, GPT-OSS-20B achieved 80.5% accuracy, while GPT-5.4 reached 93.8%. We also evaluated other open and proprietary models: Gemma-4-31B achieved the strongest result among open models, and Gemini-3.1-Pro-Preview achieved the best overall result among all evaluated models. Metamorphic testing further shows that model predictions are largely consistent under intended semantics-preserving code variations, suggesting that superficial pattern matching may not fully account for the observed behavior. Beyond detection accuracy, foundation models can provide short explanations that may help support developer inspection, operate across refactoring types without explicitly encoded refactoring-specific rules, and may serve as lightweight triage aids in development workflows. Our findings suggest that foundation models can complement traditional refactoring checks by flagging suspicious transformations for developer inspection.

Problem

Research questions and friction points this paper is trying to address.

refactoring correctness

behavioral changes

compilation errors

automated transformations

developer trust

Innovation

Methods, ideas, or system contributions that make the work stand out.

foundation models

refactoring correctness

zero-shot prompting