DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the scarcity of large-scale, multi-source domain-specific text corpora in distributed ledger technology (DLT), which has hindered research on language understanding and technological evolution. To bridge this gap, we construct DLT-Corpus—the largest DLT corpus to date—comprising 22.12 million documents and 2.98 billion tokens, systematically integrating scientific literature, U.S. patents, and social media data. We further propose LedgerBERT, a domain-adaptive pre-trained language model tailored for DLT. Leveraging this resource, we conduct cross-modal association analyses that uncover the propagation pathway of innovations from scientific research to patents and then to social media, revealing that research and patenting activities operate independently of market fluctuations and drive long-term innovation. LedgerBERT achieves a 23% improvement over BERT-base on DLT-specific named entity recognition tasks. All data, models, and code are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.
Problem

Research questions and friction points this paper is trying to address.

Distributed Ledger Technology
Natural Language Processing
domain-specific language
text corpus
DLT
Innovation

Methods, ideas, or system contributions that make the work stand out.

DLT-Corpus
LedgerBERT
domain-adapted language model
Named Entity Recognition
distributed ledger technology
🔎 Similar Papers
No similar papers found.