π€ AI Summary
To address fine-grained, purpose-driven access control in large-scale data warehouses, this paper proposes a semantic-aware dynamic data masking framework. It automatically constructs functional SQL views based on data semantics and access purposes, enabling transparent, runtime purpose-aware query routing. The framework introduces the first sub-fieldβlevel masking mechanism supporting complex types (e.g., STRUCT, ARRAY, MAP), overcoming traditional row- and column-level limitations. Purpose semantics are explicitly encoded into access policies to jointly satisfy regulatory compliance (e.g., GDPR) and data utility. Key technical contributions include semantic policy modeling, automated view generation, a multi-level masking engine, and purpose-aware query optimization. Experiments demonstrate that sub-field masking achieves three orders-of-magnitude higher precision, end-to-end latency remains bounded, legacy applications require zero modification for compatibility, and policy deployment efficiency improves by 90%.
π Abstract
The last few years have witnessed a spate of data protection regulations in conjunction with an ever-growing appetite for data usage in large businesses, thus presenting significant challenges for businesses to maintain compliance. To address this conflict, we present Data Guard - a fine-grained, purpose-based access control system for large data warehouses. Data Guard enables authoring policies based on semantic descriptions of data and purpose of data access. Data Guard then translates these policies into SQL views that mask data from the underlying warehouse tables. At access time, Data Guard ensures compliance by transparently routing each table access to the appropriate data-masking view based on the purpose of the access, thus minimizing the effort of adopting Data Guard in existing applications. Our enforcement solution allows masking data at much finer granularities than what traditional solutions allow. In addition to row and column level data masking, Data Guard can mask data at the sub-cell level for columns with non-atomic data types such as structs, arrays, and maps. This fine-grained masking allows Data Guard to preserve data utility for consumers while ensuring compliance. We implemented a number of performance optimizations to minimize the overhead of data masking operations. We perform numerous experiments to identify the key factors that influence the data masking overhead and demonstrate the efficiency of our implementation.