π€ AI Summary
This study addresses the severe scarcity of structured data on supply chain relationships in China, particularly involving non-listed and long-tail firms. To overcome this limitation, the authors propose a lightweight evidence-based framework that leverages search engine text snippets in conjunction with large language models to efficiently extract enterprise-level supplierβcustomer relationships. The approach constructs an auditable and traceable supply chain knowledge graph while significantly reducing computational overhead. Remarkably, it achieves 7.2 times greater enterprise coverage and 9.3 times more relationship coverage compared to conventional databases, at only 1/251 of the input cost, all while maintaining low redundancy and high traceability.
π Abstract
Financial and economic research often relies on structured supply-chain disclosures and commercial databases. In China, supplier--customer disclosure is typically limited to major partners of listed firms, leaving unlisted firms and long-tail inter-firm links poorly captured in structured data. Public web evidence can partly complement this gap through corporate, government, and trade-media disclosures; however, full-text web mining at scale is costly because pages are often inaccessible or expensive to process with large language models (LLMs). We propose a snippet-driven method for constructing a supply chain knowledge graph (SCKG), with firms as nodes and inter-firm relationships as edges. Web search snippets are query-biased summaries returned with search results. We use them as a scalable first-pass evidence layer for LLM-based relationship extraction. We evaluate the pipeline in terms of extraction efficiency and coverage. For extraction efficiency, exhaustive full-text chunking discovers 19.8$\times$ more unique relationships than snippets, but requires 251.2$\times$ more input tokens and yields higher redundancy. For coverage, we use 130,685 Chinese firms as search seeds, covering Shanghai/Shenzhen-listed firms and large unlisted firms as of 2024. In the listed-firm subset, the resulting SCKG covers 7.2$\times$ more firms and 9.3$\times$ more relationships than the CSMAR disclosure-based benchmark, while revealing heavy-tailed degree patterns. Retained provenance metadata make the SCKG an auditable complement to disclosure-based databases.