🤖 AI Summary
This work addresses the challenge of silent defects—such as numerical inaccuracies or performance regressions—introduced by updates to deep learning libraries, which are difficult for downstream users to detect. To this end, the authors propose DepRadar, a novel framework that introduces a multi-agent collaboration mechanism for impact analysis of such defects. DepRadar operates through a three-stage pipeline: extracting semantic changes from code modifications, synthesizing structured defect patterns enriched with triggering conditions, and combining static program analysis with domain-specific deep learning rules to infer whether client programs are affected. Evaluated on 157 pull requests and 70 commits, DepRadar achieves 90% accuracy in defect identification and demonstrates 90% recall and 80% precision across 122 real-world client programs, substantially outperforming existing baseline approaches.
📝 Abstract
Deep learning libraries like Transformers and Megatron are now widely adopted in modern AI programs. However, when these libraries introduce defects, ranging from silent computation errors to subtle performance regressions, it is often challenging for downstream users to assess whether their own programs are affected. Such impact analysis requires not only understanding the defect semantics but also checking whether the client code satisfies complex triggering conditions involving configuration flags, runtime environments, and indirect API usage. We present DepRadar, an agent coordination framework for fine grained defect and impact analysis in DL library updates. DepRadar coordinates four specialized agents across three steps: 1. the PR Miner and Code Diff Analyzer extract structured defect semantics from commits or pull requests, 2. the Orchestrator Agent synthesizes these signals into a unified defect pattern with trigger conditions, and 3. the Impact Analyzer checks downstream programs to determine whether the defect can be triggered. To improve accuracy and explainability, DepRadar integrates static analysis with DL-specific domain rules for defect reasoning and client side tracing. We evaluate DepRadar on 157 PRs and 70 commits across two representative DL libraries. It achieves 90% precision in defect identification and generates high quality structured fields (average field score 1.6). On 122 client programs, DepRadar identifies affected cases with 90% recall and 80% precision, substantially outperforming other baselines.