AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

📅 2026-05-17

📈 Citations: 0

✨ Influential: 0

career value

158K/year

🤖 AI Summary

This work addresses the challenge of legacy system modernization, where conventional approaches often fail to preserve critical business logic due to overlooked implicit rules and cross-module constraints, and are typically limited to syntactic transformation. To overcome these limitations, the authors propose AgentModernize, a multi-agent framework that reframes modernization as a behavior preservation problem. The framework orchestrates four specialized agents—responsible for extraction, specification, generation, and validation—and introduces, for the first time, a Behavior Specification Graph (BSG) as an auditable intermediate representation to enable verifiable logic preservation prior to code generation. Evaluated on the LegacyModernize-8 benchmark, AgentModernize is the only approach achieving non-zero Behavior Equivalence Rates (BER) across all models, reaching up to 19.4%, while its BSG successfully captures 91.2% of the gold-standard behavioral rules.

📝 Abstract

Legacy modernization breaks business logic. Most tools and LLM-based approaches treat modernization as syntax translation, losing implicit rules, edge-case handling, and cross-module constraints. We present AgentModernize, a multi-agent framework that treats modernization as a behavioral preservation problem. Four specialized agents handle extraction, specification, code generation, and validation. The key intermediate artifact -- a Behavioral Specification Graph (BSG) -- forces extracted business logic to be explicit and inspectable before any code is generated. We evaluated on LegacyModernize-8, eight scenarios spanning telecom and banking, using three models (GPT-4o-mini, GPT-4o, GPT-5.3-codex) under a fair protocol: same gold-standard tests, 3 trials, temperature 0.0. Full AgentModernize with feedback was the only configuration with non-zero mean BER under every backbone. SP-LLM and CoT-LLM scored 0.0% on every scenario, on every backbone. AgentModernize without feedback scored 0.0% mean BER with GPT-4o-mini and GPT-5.3-codex; under GPT-4o it achieved non-zero BER only on S1 (44.4%; 5.6% mean over scenarios). Mean BER for full AgentModernize was 9.4% (mini), 8.1% (GPT-4o), and 19.4% (codex). The BSG captures 91.2% of gold-standard rules, confirming that the bottleneck is code generation, not extraction.

Problem

Research questions and friction points this paper is trying to address.

legacy modernization

business logic preservation

behavioral specification

multi-agent LLMs

code generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent LLMs

behavioral preservation

Behavioral Specification Graph