An LLM-enabled semantic-centric framework to consume privacy policies

📅 2025-09-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Privacy policies are notoriously opaque, leading users to routinely ignore them—severely impeding user-centric web data sharing and regulatory compliance. To address this, we propose Pr²Graph: the first large-scale, semantics-centered knowledge graph explicitly designed for privacy policies. Leveraging large language models, Pr²Graph jointly performs information extraction and semantic annotation grounded in the Data Privacy Vocabulary (DPV) ontology, automatically transforming unstructured privacy texts into standardized, machine-interpretable knowledge graphs. It further supports automated generation of formal policy representations—including ODRL and psDToU—and publicly releases both the Pr²Graph for the top-100 websites and an enhanced Policy-IE dataset. Empirical evaluation demonstrates substantial improvements in accuracy for extracting privacy practices. Pr²Graph establishes a scalable, verifiable technical foundation for web-scale privacy compliance auditing and controllable data sharing in intelligent agent environments.

Technology Category

Application Category

📝 Abstract
In modern times, people have numerous online accounts, but they rarely read the Terms of Service or Privacy Policy of those sites, despite claiming otherwise, due to the practical difficulty in comprehending them. The mist of data privacy practices forms a major barrier for user-centred Web approaches, and for data sharing and reusing in an agentic world. Existing research proposed methods for using formal languages and reasoning for verifying the compliance of a specified policy, as a potential cure for ignoring privacy policies. However, a critical gap remains in the creation or acquisition of such formal policies at scale. We present a semantic-centric approach for using state-of-the-art large language models (LLM), to automatically identify key information about privacy practices from privacy policies, and construct $mathit{Pr}^2mathit{Graph}$, knowledge graph with grounding from Data Privacy Vocabulary (DPV) for privacy practices, to support downstream tasks. Along with the pipeline, the $mathit{Pr}^2mathit{Graph}$ for the top-100 popular websites is also released as a public resource, by using the pipeline for analysis. We also demonstrate how the $mathit{Pr}^2mathit{Graph}$ can be used to support downstream tasks by constructing formal policy representations such as Open Digital Right Language (ODRL) or perennial semantic Data Terms of Use (psDToU). To evaluate the technology capability, we enriched the Policy-IE dataset by employing legal experts to create custom annotations. We benchmarked the performance of different large language models for our pipeline and verified their capabilities. Overall, they shed light on the possibility of large-scale analysis of online services' privacy practices, as a promising direction to audit the Web and the Internet. We release all datasets and source code as public resources to facilitate reuse and improvement.
Problem

Research questions and friction points this paper is trying to address.

Automating extraction of privacy practices from policies using LLMs
Creating scalable knowledge graphs for privacy policy analysis
Supporting formal policy representations for compliance verification
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLMs extract key privacy information from policies
Constructs knowledge graph using Data Privacy Vocabulary
Generates formal policy representations like ODRL
🔎 Similar Papers
No similar papers found.
R
Rui Zhao
University of Oxford, Oxford, UK
V
Vladyslav Melnychuk
University of Oxford, Oxford, UK
J
Jun Zhao
University of Oxford, Oxford, UK
Jesse Wright
Jesse Wright
University of Oxford
Semantic WebReasoningTyped Programming
N
Nigel Shadbolt
University of Oxford, Oxford, UK