Prink: $k_s$-Anonymization for Streaming Data in Apache Flink

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

Existing kₛ-anonymization techniques for real-time streaming data lack effective support for non-numerical attributes—particularly categorical and hierarchical data—leading to coarse-grained generalization and high information loss. Method: This paper proposes a semantics-aware streaming kₛ-anonymization framework, extending CASTLE with lightweight, Flink-native operators that dynamically maintain equivalence classes and apply hierarchy-aware generalization. Unlike prior approaches, it outputs discrete anonymized records—not aggregated results—enabling fine-grained, low-distortion anonymization of categorical and hierarchical attributes. Contribution/Results: The method guarantees strict kₛ-anonymity while achieving information loss under 8.2% and throughput degradation below 15%. It is production-ready: seamlessly integrable into enterprise-grade stream processing systems (e.g., Apache Flink) as a plug-and-play module. To the best of our knowledge, this is the first efficient, deployable solution for privacy-preserving stream processing over non-numerical data.

Technology Category

Application Category

📝 Abstract

In this paper, we present Prink, a novel and practically applicable concept and fully implemented prototype for ks-anonymizing data streams in real-world application architectures. Building upon the pre-existing, yet rudimentary CASTLE scheme, Prink for the first time introduces semantics-aware ks-anonymization of non-numerical (such as categorical or hierarchically generalizable) streaming data in a information loss-optimized manner. In addition, it provides native integration into Apache Flink, one of the prevailing frameworks for enterprise-grade stream data processing in numerous application domains. Our contributions excel the previously established state of the art for the privacy guarantee-providing anonymization of streaming data in that they 1) allow to include non-numerical data in the anonymization process, 2) provide discrete datapoints instead of aggregates, thereby facilitating flexible data use, 3) are applicable in real-world system contexts with minimal integration efforts, and 4) are experimentally proven to raise acceptable performance overheads and information loss in realistic settings. With these characteristics, Prink provides an anonymization approach which is practically feasible for a broad variety of real-world, enterprise-grade stream processing applications and environments.

Problem

Research questions and friction points this paper is trying to address.

Real-time ks-anonymization for streaming data

Semantics-aware anonymization of non-numerical data

Native integration with Apache Flink framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantics-aware ks-anonymization for non-numerical data

Native integration into Apache Flink framework

Optimized information loss with discrete datapoints

🔎 Similar Papers

Differentially Private Clustering in Data Streams