🤖 AI Summary
Existing web annotation tools are limited to text snippets and lack fine-grained, multi-label classification capabilities at the webpage level. To address this, we propose the first systematic, webpage-level topic annotation framework. Our method integrates HTML content parsing (via BeautifulSoup/Lxml) with URL semantic feature engineering, supporting dual-source (real-time/offline) page loading, multi-topic co-occurrence labeling, and dynamic label configuration, all via a lightweight Python-based GUI. Key contributions include: (1) the first structured annotation paradigm operating at the full-page granularity; (2) joint exploitation of content and URL features for enhanced classification accuracy; and (3) a scalable pipeline capable of processing millions of webpages. Empirical evaluation shows a 3.2× improvement in annotation throughput. The framework is open-sourced and has been adopted in five studies on web privacy and user behavior analysis.
📝 Abstract
Tag-Pag is an application designed to simplify the categorization of web pages, a task increasingly common for researchers who scrape web pages to analyze individuals' browsing patterns or train machine learning classifiers. Unlike existing tools that focus on annotating sections of text, Tag-Pag systematizes page-level annotations, allowing users to determine whether an entire document relates to one or multiple predefined topics. Tag-Pag offers an intuitive interface to configure the input web pages and annotation labels. It integrates libraries to extract content from the HTML and URL indicators to aid the annotation process. It provides direct access to both scraped and live versions of the web page. Our tool is designed to expedite the annotation process with features like quick navigation, label assignment, and export functionality, making it a versatile and efficient tool for various research applications. Tag-Pag is available at https://github.com/Pantonius/TagPag.