Snorkeling in dark waters: A longitudinal surface exploration of unique Tor Hidden Services (Extended Version)

📅 2025-04-23
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the long-standing lack of empirical understanding regarding the scale and content distribution of Tor hidden services (.onion). We conduct large-scale Internet measurement to construct a dynamic dataset covering over 25,000 hidden services. Our methodology introduces the Mimir crawler system, a multi-layer heuristic mirroring detection algorithm, a customized machine learning–based content classifier, and deep/reachability graph analysis techniques. We present the first systematic identification and quantification of content mirroring—observing it in up to 82% of measured hidden services—and demonstrate that conventional measurements suffer severe topological and thematic distribution biases due to ignoring mirroring. Furthermore, we propose a topic-focused crawling framework guided by initial seed selection and empirically validate the critical impact of seed choice on thematic coverage efficiency. This work delivers the most comprehensive, reproducible empirical analysis to date of the structural organization and content distribution across Tor hidden services.

Technology Category

Application Category

📝 Abstract
The Onion Router (Tor) is a controversial network whose utility is constantly under scrutiny. On the one hand, it allows for anonymous interaction and cooperation of users seeking untraceable navigation on the Internet. This freedom also attracts criminals who aim to thwart law enforcement investigations, e.g., trading illegal products or services such as drugs or weapons. Tor allows delivering content without revealing the actual hosting address, by means of .onion (or hidden) services. Different from regular domains, these services can not be resolved by traditional name services, are not indexed by regular search engines, and they frequently change. This generates uncertainty about the extent and size of the Tor network and the type of content offered. In this work, we present a large-scale analysis of the Tor Network. We leverage our crawler, dubbed Mimir, which automatically collects and visits content linked within the pages to collect a dataset of pages from more than 25k sites. We analyze the topology of the Tor Network, including its depth and reachability from the surface web. We define a set of heuristics to detect the presence of replicated content (mirrors) and show that most of the analyzed content in the Dark Web (82% approx.) is a replica of other content. Also, we train a custom Machine Learning classifier to understand the type of content the hidden services offer. Overall, our study provides new insights into the Tor network, highlighting the importance of initial seeding for focus on specific topics, and optimize the crawling process. We show that previous work on large-scale Tor measurements does not consider the presence of mirrors, which biases their understanding of the Dark Web topology and the distribution of content.
Problem

Research questions and friction points this paper is trying to address.

Analyzing Tor network topology and content distribution
Detecting replicated content in Dark Web services
Classifying types of content offered by hidden services
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mimir crawler collects Tor hidden services data
Heuristics detect replicated Dark Web content
Machine Learning classifies hidden services content
🔎 Similar Papers
No similar papers found.
A
Alfonso Rodriguez Barredo-Valenzuela
IMDEA Networks Institute, Universidad Carlos III de Madrid
S
Sergio Pastrana Portillo
Universidad Carlos III de Madrid
Guillermo Suarez-Tangil
Guillermo Suarez-Tangil
Assistant Professor, IMDEA Networks Institute
Systems SecurityMalware AnalysisCyber CrimeFraud