🤖 AI Summary
To address the risk of sensitive user prompts—such as clinical records or financial data—being leaked in cloud-hosted large language model (LLM) services, this paper proposes an end-to-end privacy-preserving framework balancing prompt confidentiality, output consistency, and computational efficiency. The method integrates secure prompt sharding and decoding (SPD) with prompt obfuscation (PO) within a trusted execution environment (TEE), leveraging token-level secure decoupling to achieve absolute prompt isolation inside confidential VMs and robustness against prompt reconstruction attacks. Evaluation demonstrates that the framework preserves original LLM output quality and inference latency while providing strong confidentiality guarantees for sensitive prompts. It significantly outperforms existing approaches based on homomorphic encryption or differential privacy in both security and efficiency.
📝 Abstract
Our work tackles the challenge of securing user inputs in cloud-hosted large language model (LLM) serving while ensuring model confidentiality, output invariance, and compute efficiency. We introduce Secure Partitioned Decoding (SPD), which uses confidential computing to confine user prompts to a trusted execution environment (TEE), namely a confidential virtual machine (CVM), while allowing service providers to generate tokens efficiently. We also introduce a novel cryptographic method, Prompt Obfuscation (PO), to ensure robustness against reconstruction attacks on SPD. We demonstrate our approach preserves both prompt confidentiality and LLM serving efficiency. Our solution enables privacy-preserving cloud LLM serving that handles sensitive prompts, such as clinical records, financial data, and personal information.