DLT-Corpus: A Large-Scale Text Collection for the Distributed Ledger Technology Domain

📅 2026-02-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This study addresses the scarcity of large-scale, multi-source domain-specific text corpora in distributed ledger technology (DLT), which has hindered research on language understanding and technological evolution. To bridge this gap, we construct DLT-Corpus—the largest DLT corpus to date—comprising 22.12 million documents and 2.98 billion tokens, systematically integrating scientific literature, U.S. patents, and social media data. We further propose LedgerBERT, a domain-adaptive pre-trained language model tailored for DLT. Leveraging this resource, we conduct cross-modal association analyses that uncover the propagation pathway of innovations from scientific research to patents and then to social media, revealing that research and patenting activities operate independently of market fluctuations and drive long-term innovation. LedgerBERT achieves a 23% improvement over BERT-base on DLT-specific named entity recognition tasks. All data, models, and code are publicly released.

Technology Category

Application Category

📝 Abstract

We introduce DLT-Corpus, the largest domain-specific text collection for Distributed Ledger Technology (DLT) research to date: 2.98 billion tokens from 22.12 million documents spanning scientific literature (37,440 publications), United States Patent and Trademark Office (USPTO) patents (49,023 filings), and social media (22 million posts). Existing Natural Language Processing (NLP) resources for DLT focus narrowly on cryptocurrencies price prediction and smart contracts, leaving domain-specific language under explored despite the sector's ~$3 trillion market capitalization and rapid technological evolution. We demonstrate DLT-Corpus' utility by analyzing technology emergence patterns and market-innovation correlations. Findings reveal that technologies originate in scientific literature before reaching patents and social media, following traditional technology transfer patterns. While social media sentiment remains overwhelmingly bullish even during crypto winters, scientific and patent activity grow independently of market fluctuations, tracking overall market expansion in a virtuous cycle where research precedes and enables economic growth that funds further innovation. We publicly release the full DLT-Corpus; LedgerBERT, a domain-adapted model achieving 23% improvement over BERT-base on a DLT-specific Named Entity Recognition (NER) task; and all associated tools and code.

Problem

Research questions and friction points this paper is trying to address.

Distributed Ledger Technology

Natural Language Processing

domain-specific language

text corpus

DLT

Innovation

Methods, ideas, or system contributions that make the work stand out.

DLT-Corpus

LedgerBERT

domain-adapted language model