Testing Framework Migration with Large Language Models

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

118K/year

🤖 AI Summary

This work proposes an automated approach leveraging large language models (LLMs) to migrate unittest-based test suites in Python projects to Pytest, aiming to reduce manual effort and accelerate test modernization. We introduce the first real-world dataset for unittest-to-Pytest migration and systematically evaluate the performance of GPT-4o and Claude Sonnet 4 under zero-shot, one-shot, and chain-of-thought prompting strategies. Experimental results show that 48.5% of LLM-generated migrations pass the original test suite: GPT-4o tends to aggressively refactor tests into a functional style, whereas Claude Sonnet 4 conservatively preserves the original class-based structure. These findings demonstrate the potential of LLMs as assistants in test migration tasks while underscoring the necessity of careful validation of their outputs.

Technology Category

Application Category

📝 Abstract

Python developers rely on two major testing frameworks: \texttt{unittest} and \texttt{Pytest}. While \texttt{Pytest} offers simpler assertions, reusable fixtures, and better interoperability, migrating existing suites from \texttt{unittest} remains a manual and time-consuming process. Automating this migration could substantially reduce effort and accelerate test modernization. In this paper, we investigate the capability of Large Language Models (LLMs) to automate test framework migrations from \texttt{unittest} to \texttt{Pytest}. We evaluate GPT 4o and Claude Sonnet 4 under three prompting strategies (Zero-shot, One-shot, and Chain-of-Thought) and two temperature settings (0.0 and 1.0). To support this analysis, we first introduce a curated dataset of real-world migrations extracted from the top 100 Python open-source projects. Next, we actually execute the LLM-generated test migrations in their respective test suites. Overall, we find that 51.5% of the LLM-generated test migrations failed, while 48.5% passed. The results suggest that LLMs can accelerate test migration, but there are often caveats. For example, Claude Sonnet 4 exhibited more conservative migrations (e.g., preserving class-based tests and legacy \texttt{unittest} references), while GPT-4o favored more transformations (e.g., to function-based tests). We conclude by discussing multiple implications for practitioners and researchers.

Problem

Research questions and friction points this paper is trying to address.

test migration

unittest

Pytest

Large Language Models

automation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Language Models

Test Migration

Pytest