The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence

📅 2025-04-19

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

Extracting high-quality cyber threat intelligence (CTI) from underground criminal data sources—such as dark web forums, encrypted messaging platforms, and illicit websites—is hindered by limited accessibility, high noise levels, and semantically obfuscated content. Method: We construct the first large-scale, cross-platform “first-hand” criminal dataset comprising 7.12 million records. We propose a dual-dimensional CTI classification framework—distinguishing technical and strategic intelligence—and design a domain-adapted, multi-stage NLP pipeline for automated noise filtering, topic modeling, and fine-grained intent identification in black-market text. Contribution/Results: Our analysis reveals systematic platform-level differences in criminal sophistication and risk preference. Experiments achieve 20% precision in CTI-relevance identification. We find credit-card fraud dominates dark web activity, while account trading prevails on forums and encrypted chats; statistically significant differences in topic diversity across platforms are also confirmed.

Technology Category

Application Category

📝 Abstract

Cyber threats have become increasingly prevalent and sophisticated. Prior work has extracted actionable cyber threat intelligence (CTI), such as indicators of compromise, tactics, techniques, and procedures (TTPs), or threat feeds from various sources: open source data (e.g., social networks), internal intelligence (e.g., log data), and ``first-hand'' communications from cybercriminals (e.g., underground forums, chats, darknet websites). However,"first-hand"data sources remain underutilized because it is difficult to access or scrape their data. In this work, we analyze (i) 6.6 million posts, (ii) 3.4 million messages, and (iii) 120,000 darknet websites. We combine NLP tools to address several challenges in analyzing such data. First, even on dedicated platforms, only some content is CTI-relevant, requiring effective filtering. Second,"first-hand"data can be CTI-relevant from a technical or strategic viewpoint. We demonstrate how to organize content along this distinction. Third, we describe the topics discussed and how"first-hand"data sources differ from each other. According to our filtering, 20% of our sample is CTI-relevant. Most of the CTI-relevant data focuses on strategic rather than technical discussions. Credit card-related crime is the most prevalent topic on darknet websites. On underground forums and chat channels, account and subscription selling is discussed most. Topic diversity is higher on underground forums and chat channels than on darknet websites. Our analyses suggest that different platforms may be used for activities with varying complexity and risks for criminals.

Problem

Research questions and friction points this paper is trying to address.

Analyzing underutilized first-hand cybercriminal data sources

Filtering CTI-relevant content from noisy underground platforms

Comparing topic diversity across darknet forums and chats

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzing 6.6M posts, 3.4M messages, 120K darknet websites

Combining NLP tools for CTI-relevant content filtering

Organizing content by technical vs. strategic viewpoints

🔎 Similar Papers

CTISum: A New Benchmark Dataset For Cyber Threat Intelligence Summarization