An LLM Agentic Approach for Legal-Critical Software: A Case Study for Tax Prep Software

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) exhibit low reliability when translating natural-language legal regulations into executable logic for safety-critical domains such as federal tax software, primarily due to ambiguity and hallucination. Method: We propose an automated verification framework integrating metamorphic testing—leveraging high-order metamorphic relations—and role-based multi-agent collaboration. The framework combines LLM-driven agent specialization, counterexample-guided test generation, and joint code correctness verification. Contribution/Results: Evaluated on complex U.S. tax law tasks, our implementation using GPT-4o-mini achieves a worst-case pass rate of 45%, substantially outperforming GPT-4o (9%) and Claude 3.5 Sonnet (15%). This constitutes the first systematic demonstration that the multi-agent + metamorphic testing paradigm significantly enhances both validity and robustness in legal-sensitive code generation—addressing critical trust barriers in regulated AI applications.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) show promise for translating natural-language statutes into executable logic, but reliability in legally critical settings remains challenging due to ambiguity and hallucinations. We present an agentic approach for developing legal-critical software, using U.S. federal tax preparation as a case study. The key challenge is test-case generation under the oracle problem, where correct outputs require interpreting law. Building on metamorphic testing, we introduce higher-order metamorphic relations that compare system outputs across structured shifts among similar individuals. Because authoring such relations is tedious and error-prone, we use an LLM-driven, role-based framework to automate test generation and code synthesis. We implement a multi-agent system that translates tax code into executable software and incorporates a metamorphic-testing agent that searches for counterexamples. In experiments, our framework using a smaller model (GPT-4o-mini) achieves a worst-case pass rate of 45%, outperforming frontier models (GPT-4o and Claude 3.5, 9-15%) on complex tax-code tasks. These results support agentic LLM methodologies as a path to robust, trustworthy legal-critical software from natural-language specifications.
Problem

Research questions and friction points this paper is trying to address.

Addressing LLM reliability in legal-critical software development
Automating test generation for tax code interpretation challenges
Overcoming ambiguity and hallucinations in statutory translation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic LLM framework for legal software
Higher-order metamorphic relations testing
Automated test generation via role-based agents
🔎 Similar Papers
No similar papers found.