🤖 AI Summary
Existing research on dark web content predominantly relies on static snapshots, which fail to capture the dynamic evolution of cybercriminal topics. This work proposes the first longitudinal topic modeling framework tailored for the dark web, integrating domain-specific embeddings, density-based clustering, and time-series aggregation to enable site-level topic tracking across over 11.4 million HTML snapshots collected over six years. The approach identifies 55 distinct topic clusters, revealing that 75% of discussions concentrate around a small set of persistent core themes. With a median topic lifespan of 75 months, the findings demonstrate that dark web content exhibits pronounced continuity and structured evolutionary patterns, thereby overcoming the limitations inherent in conventional static analyses.
📝 Abstract
The dark web hosts a dynamic ecosystem of cybercrime forums and marketplaces that adapt to law enforcement pressure, technological change, and economic incentives. Prior research has extracted cyber threat intelligence from these platforms using static snapshots, with limited attention to how discussions evolve over time. In this study, we conduct a longitudinal analysis of 25,065 websites in the dark web using 11,403,638 HTML snapshots (approximately 1245.38 GB) collected over six years. We develop a longitudinal topic-modeling framework combining domain-specific embeddings, density-based clustering and temporal aggregation to measure topic prevalence and lifecycle at the website level. Our analysis identifies 55 thematic clusters. We find that approximately 75% of total discussion volume is concentrated in a small set of persistent core topics, while short-lived themes account for approximately 3% of activity. The median topic lifespan is 75 months, indicating gradual thematic evolution rather than abrupt replacement.