π€ AI Summary
To address unauthorized access risks to sensitive data in data lakes, this paper proposes Membraneβa novel system that jointly designs static encryption with an SQL-aware encryption protocol. Leveraging a compute-storage separation architecture, Membrane enforces data-dependent, fine-grained access control views. Its key innovation is decrypting restricted views only once during session initialization; all subsequent queries execute entirely in plaintext, thus balancing strong security with analytical flexibility. At the storage layer, Membrane employs hardware-accelerated block ciphers and symmetric-key cryptography to ensure robust at-rest encryption, while its SQL-aware protocol enables dynamic enforcement of column- and row-level access policies. Experimental results show an initial query latency overhead of approximately 20Γ; however, amortized query performance approaches that of unencrypted baselines. Under stringent security constraints, Membrane achieves a low-overhead equilibrium between confidentiality and usability.
π Abstract
Organizations use data lakes to store and analyze sensitive data. But hackers may compromise data lake storage to bypass access controls and access sensitive data. To address this, we propose Membrane, a system that (1) cryptographically enforces data-dependent access control views over a data lake, (2) without restricting the analytical queries data scientists can run. We observe that data lakes, unlike DBMSes, disaggregate computation and storage into separate trust domains, making at-rest encryption sufficient to defend against remote attackers targeting data lake storage, even when running analytical queries in plaintext. This leads to a new system design for Membrane that combines encryption at rest with SQL-aware encryption. Using block ciphers, a fast symmetric-key primitive with hardware acceleration in CPUs, we develop a new SQL-aware encryption protocol well-suited to at-rest encryption. Membrane adds overhead only at the start of an interactive session due to decrypting views, delaying the first query result by up to $approx 20 imes$; subsequent queries process decrypted data in plaintext, resulting in low amortized overhead.