Evaluating the Use of LLMs for Documentation to Code Traceability

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This paper systematically evaluates large language models’ (LLMs) capability to establish traceability links between software documentation (e.g., API references, user guides) and source code. To this end, it introduces a task-oriented “one-to-many matching” framework and constructs two novel benchmark datasets—Unity Catalog and Crawl4AI—enabling the first comprehensive assessment of Claude 3.5 Sonnet, GPT-4o, and o3-mini across three subtasks: link identification, relationship explanation, and multi-hop chain reconstruction. Baselines include TF-IDF, BM25, and CodeBERT. The study identifies prevalent failure modes, including naming-assumption bias, phantom links, and architectural generalization breakdown. Results show that the best-performing LLM achieves F1 scores of 79.4% and 80.4% on the two benchmarks—substantially outperforming traditional methods—while attaining >97% accuracy in relationship explanation. Endpoint identification is robust, yet intermediate-hop precision remains limited, indicating a key area for improvement.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for documentation-to-code traceability automation

Assessing LLM capabilities in trace link identification accuracy

Analyzing LLM performance in multi-step chain reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating LLMs for documentation-code traceability

Creating novel datasets from open-source projects

Assessing accuracy and explanation quality systematically

🔎 Similar Papers

No similar papers found.