METAMON: Finding Inconsistencies between Program Documentation and Behavior using Metamorphic LLM Queries

📅 2025-02-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Code documentation frequently exhibits semantic inconsistency with actual program behavior, and—being non-executable—cannot be automatically verified. To address this, we propose the first automated detection method that integrates metamorphic testing with large language model (LLM)-based self-consistent reasoning. Our approach first employs search-based testing to generate behavior-covering test cases and assertions; it then leverages an LLM to perform multi-path semantic consistency reasoning between documentation specifications and test assertions, identifying inconsistencies via self-consistent voting. The method requires no human annotation and enables end-to-end detection of semantic-level inconsistencies. Evaluated on 9,482 real-world code–documentation pairs, it achieves 72% precision and 48% recall, significantly outperforming existing baseline methods.

Technology Category

Application Category

📝 Abstract
Code documentation can, if written precisely, help developers better understand the code they accompany. However, unlike code, code documentation cannot be automatically verified via execution, potentially leading to inconsistencies between documentation and the actual behavior. While such inconsistencies can be harmful for the developer's understanding of the code, checking and finding them remains a costly task due to the involvement of human engineers. This paper proposes METAMON, which uses an existing search-based test generation technique to capture the current program behavior in the form of test cases, and subsequently uses LLM-based code reasoning to identify the generated regression test oracles that are not consistent with the program specifications in the documentation. METAMON is supported in this task by metamorphic testing and self-consistency. An empirical evaluation against 9,482 pairs of code documentation and code snippets, generated using five open-source projects from Defects4J v2.0.1, shows that METAMON can classify the code-and-documentation inconsistencies with a precision of 0.72 and a recall of 0.48.
Problem

Research questions and friction points this paper is trying to address.

Detects inconsistencies between code and documentation.
Uses metamorphic testing and LLM for verification.
Improves precision in identifying documentation errors.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Metamorphic testing for inconsistencies
LLM-based code reasoning
Search-based test generation
🔎 Similar Papers
No similar papers found.