🤖 AI Summary
Large language models (LLMs) exhibit low reliability when translating natural-language legal regulations into executable logic for safety-critical domains such as federal tax software, primarily due to ambiguity and hallucination.
Method: We propose an automated verification framework integrating metamorphic testing—leveraging high-order metamorphic relations—and role-based multi-agent collaboration. The framework combines LLM-driven agent specialization, counterexample-guided test generation, and joint code correctness verification.
Contribution/Results: Evaluated on complex U.S. tax law tasks, our implementation using GPT-4o-mini achieves a worst-case pass rate of 45%, substantially outperforming GPT-4o (9%) and Claude 3.5 Sonnet (15%). This constitutes the first systematic demonstration that the multi-agent + metamorphic testing paradigm significantly enhances both validity and robustness in legal-sensitive code generation—addressing critical trust barriers in regulated AI applications.
📝 Abstract
Large language models (LLMs) show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher-order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error-prone, we use an LLM-driven, role-based framework to automate test generation and code synthesis. We implement a multi-agent system that translates tax code into executable software and incorporates a metamorphic-testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT-4o-mini) achieves a worst-case pass rate of 45%, outperforming frontier models (GPT-4o and Claude 3.5, 9-15%) on complex tax-code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal-critical software from natural-language specifications.