CTI Dataset Construction from Telegram

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity, high noise level, and labeling difficulty of Cyber Threat Intelligence (CTI) data on Telegram, this paper proposes an end-to-end automated pipeline. First, domain knowledge is leveraged to identify 150 candidate threat channels; after expert curation, 12 high-quality sources are retained, yielding 145,000 cleaned messages. Second, a BERT-based binary classifier achieves 96.64% accuracy in threat-content filtering, followed by regex-based extraction and expert validation to derive structured Indicators of Compromise (IoCs). Finally, we construct the largest and highest-fidelity native Telegram CTI dataset to date—comprising 86,509 verified malicious IoCs (domains, IPs, URLs, hashes, CVEs), spanning ransomware, phishing, and malicious infrastructure threats. This dataset significantly advances CTI model development, evaluation, and benchmarking, providing a reproducible and scalable foundation for open-source threat intelligence research.

Technology Category

Application Category

📝 Abstract
Cyber Threat Intelligence (CTI) enables organizations to anticipate, detect, and mitigate evolving cyber threats. Its effectiveness depends on high-quality datasets, which support model development, training, evaluation, and benchmarking. Building such datasets is crucial, as attack vectors and adversary tactics continually evolve. Recently, Telegram has gained prominence as a valuable CTI source, offering timely and diverse threat-related information that can help address these challenges. In this work, we address these challenges by presenting an end-to-end automated pipeline that systematically collects and filters threat-related content from Telegram. The pipeline identifies relevant Telegram channels and scrapes 145,349 messages from 12 curated channels out of 150 identified sources. To accurately filter threat intelligence messages from generic content, we employ a BERT-based classifier, achieving an accuracy of 96.64%. From the filtered messages, we compile a dataset of 86,509 malicious Indicators of Compromise, including domains, IPs, URLs, hashes, and CVEs. This approach not only produces a large-scale, high-fidelity CTI dataset but also establishes a foundation for future research and operational applications in cyber threat detection.
Problem

Research questions and friction points this paper is trying to address.

Automated collection of threat intelligence from Telegram channels
Filtering relevant CTI content from generic messages using AI
Building high-quality cyber threat datasets for detection research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline collects Telegram threat data
BERT classifier filters messages with 96.64% accuracy
Extracts 86,509 malicious Indicators of Compromise
🔎 Similar Papers
No similar papers found.
D
Dincy R. Arikkat
Department of Computer Applications, Cochin University of Science and Technology, India
S
Sneha B. T.
Department of Computer Applications, Cochin University of Science and Technology, India
Serena Nicolazzo
Serena Nicolazzo
Università del Piemonte Orientale
SecurityPrivacyIoTCyber Threat Intelligence
Antonino Nocera
Antonino Nocera
Associate Professor, University of Pavia
Artificial IntelligenceSecurityPrivacyData Science
V
Vinod P.
Department of Computer Applications, Cochin University of Science and Technology, India
R
Rafidha Rehiman K. A.
Department of Computer Applications, Cochin University of Science and Technology, India
K
Karthika R.
Department of Computer Applications, Cochin University of Science and Technology, India