🤖 AI Summary
Inconsistencies between source code and its documentation often lead to misinterpretations and software defects; however, existing large language model (LLM)-based detection methods suffer from high false-positive rates, frequently misidentifying legitimate semantic gaps—such as those between high-level abstractions and low-level implementations—as errors. This paper proposes a lightweight, multi-language (Python, TypeScript, C++, Java) inconsistency detection framework. Its core contributions are: (1) Local Categorization—a context-local prompting strategy that guides LLMs to produce fine-grained, semantically grounded classifications, thereby mitigating long-range reasoning biases; and (2) External Filtering—leveraging domain-informed, rule-based post-processing to eliminate naturally occurring, non-defective discrepancies. The approach requires no LLM fine-tuning and relies solely on standard off-the-shelf models and localized prompts. Experiments demonstrate a low annotation burden (15% labeling rate), precision of 0.62, and a substantial improvement in accuracy—from 14% to 94%—significantly outperforming baseline methods.
📝 Abstract
Code-documentation inconsistencies are common and undesirable: they can lead to developer misunderstandings and software defects. This paper introduces DocPrism, a multi-language, code-documentation inconsistency detection tool. DocPrism uses a standard large language model (LLM) to analyze and explain inconsistencies. Plain use of LLMs for this task yield unacceptably high false positive rates: LLMs identify natural gaps between high-level documentation and detailed code implementations as inconsistencies. We introduce and apply the Local Categorization, External Filtering (LCEF) methodology to reduce false positives. LCEF relies on the LLM's local completion skills rather than its long-term reasoning skills. In our ablation study, LCEF reduces DocPrism's inconsistency flag rate from 98% to 14%, and increases accuracy from 14% to 94%. On a broad evaluation across Python, TypeScript, C++, and Java, DocPrism maintains a low flag rate of 15%, and achieves a precision of 0.62 without performing any fine-tuning.