🤖 AI Summary
This paper systematically evaluates large language models’ (LLMs) capability to establish traceability links between software documentation (e.g., API references, user guides) and source code. To this end, it introduces a task-oriented “one-to-many matching” framework and constructs two novel benchmark datasets—Unity Catalog and Crawl4AI—enabling the first comprehensive assessment of Claude 3.5 Sonnet, GPT-4o, and o3-mini across three subtasks: link identification, relationship explanation, and multi-hop chain reconstruction. Baselines include TF-IDF, BM25, and CodeBERT. The study identifies prevalent failure modes, including naming-assumption bias, phantom links, and architectural generalization breakdown. Results show that the best-performing LLM achieves F1 scores of 79.4% and 80.4% on the two benchmarks—substantially outperforming traditional methods—while attaining >97% accuracy in relationship explanation. Endpoint identification is robust, yet intermediate-hop precision remains limited, indicating a key area for improvement.
📝 Abstract
Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research.