Drop the Golden Apples: Identifying Third-Party Reuse by DB-Less Software Composition Analysis

📅 2025-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current Software Composition Analysis (SCA) heavily relies on static feature databases, resulting in poor detection coverage for obscure or emerging third-party libraries (TPLs), particularly in Android native libraries and C/C++ copy-based reuse scenarios. Method: This paper proposes the first database-less SCA (DB-Less SCA) framework, which eliminates prebuilt databases and instead leverages large language models (LLMs) to dynamically retrieve, compare, and validate cross-source semantic evidence from the open web—enabling multi-language (Java/Kotlin/C/C++) binary and source-code feature correlation. Contribution/Results: It is the first work to employ LLMs to emulate security analysts’ cross-source library identification; establishes a novel DB-Less SCA paradigm; and significantly improves robustness and coverage for TPLs absent from mainstream databases. Experimental evaluation demonstrates its feasibility and practicality in dynamic open-source ecosystems.

Technology Category

Application Category

📝 Abstract
The prevalent use of third-party libraries (TPLs) in modern software development introduces significant security and compliance risks, necessitating the implementation of Software Composition Analysis (SCA) to manage these threats. However, the accuracy of SCA tools heavily relies on the quality of the integrated feature database to cross-reference with user projects. While under the circumstance of the exponentially growing of open-source ecosystems and the integration of large models into software development, it becomes even more challenging to maintain a comprehensive feature database for potential TPLs. To this end, after referring to the evolution of LLM applications in terms of external data interactions, we propose the first framework of DB-Less SCA, to get rid of the traditional heavy database and embrace the flexibility of LLMs to mimic the manual analysis of security analysts to retrieve identical evidence and confirm the identity of TPLs by supportive information from the open Internet. Our experiments on two typical scenarios, native library identification for Android and copy-based TPL reuse for C/C++, especially on artifacts that are not that underappreciated, have demonstrated the favorable future for implementing database-less strategies in SCA.
Problem

Research questions and friction points this paper is trying to address.

Identifying third-party library reuse without traditional databases
Managing security risks in open-source ecosystems with LLMs
Validating TPL identity using internet-derived evidence
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes DB-Less SCA framework
Uses LLMs for TPL identification
Leverages open Internet for evidence
🔎 Similar Papers
No similar papers found.
Lyuye Zhang
Lyuye Zhang
Postdoc, Nanyang Technological University
Program AnalysisOpen sourceOpen source securitySoftware supply chainSoftware maintenace
Chengwei Liu
Chengwei Liu
Research Assistant Professor, Nanyang Technological University
Open Source SecuritySoftware Supply Chain SecurityProgram AnalysisSoftware Maintenance
J
Jiahui Wu
College of Computing and Data Science, Nanyang Technological University, Singapore
S
Shiyang Zhang
College of Intelligence and Computing, Tianjin University, China
C
Chengyue Liu
College of Computing and Data Science, Nanyang Technological University, Singapore
Zhengzi Xu
Zhengzi Xu
Senior Research Fellow, Imperial College London
Software EngineeringCyber SecurityLLMAI Trading
Sen Chen
Sen Chen
Professor, Nankai University
Software SecurityVulnerabilityMalwareSoftware Supply Chain Security
Y
Yang Liu
College of Computing and Data Science, Nanyang Technological University, Singapore