Securing Transformer-based AI Execution via Unified TEE and Crypto-protected Accelerators

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

192K/year

🤖 AI Summary

Secure inference of Transformer models (e.g., LLMs) in untrusted cloud environments faces a fundamental trade-off between strong security—ensuring data and model confidentiality and integrity—and high performance, as full execution within trusted execution environments (TEEs) incurs severe computational bottlenecks. Method: This paper proposes the first TEE–cryptographic-accelerator co-design framework for secure inference. It innovatively offloads critical operators—including attention mechanisms and SoftMax—to a dedicated cryptographic accelerator, enabling end-to-end operator-level secure collaboration. The framework integrates TEE-based memory isolation, computation offloading, and remote attestation to construct a heterogeneous trusted execution stack. Contribution/Results: Evaluation across multiple Transformer models shows that the framework offloads 87% of computation to GPU-based accelerators, achieving 4.0–6.1× speedup over baseline TEE-only execution. It outperforms existing approaches, marking the first solution to simultaneously deliver provable security and practical efficiency.

Technology Category

Application Category

📝 Abstract

Recent advances in Transformer models, e.g., large language models (LLMs), have brought tremendous breakthroughs in various artificial intelligence (AI) tasks, leading to their wide applications in many security-critical domains. Due to their unprecedented scale and prohibitively high development cost, these models have become highly valuable intellectual property for AI stakeholders and are increasingly deployed via machine learning as a service (MLaaS). However, MLaaS often runs on untrusted cloud infrastructure, exposing data and models to potential breaches. Mainstream protection mechanisms leverage trusted execution environments (TEEs) where confidentiality and integrity for secretive data are shielded using hardware-based encryption and integrity checking. Unfortunately, running model inference entirely within TEEs is subject to non-trivial slowdown, which is further exacerbated in LLMs due to the substantial computation and memory footprint involved. Recent studies reveal that the hybrid TEE-based scheme offloading partial model inference operations to the untrusted accelerators (e.g., GPU) is a promising solution. However, prior offloading schemes fail to ensure dual protection of data and model in Transformer inference, as they cannot securely offload critical operations, i.e., Attention and SoftMax, forcing these computations to remain confined within TEEs. To address these challenges, we propose TwinShield, a framework enabling secure Transformer inference in heterogeneous TEE and accelerator systems with dual protection for both model and data. TwinShield offloads ~87% of computation to GPUs and delivers 4.0x - 6.1x speedups over previous approaches across various Transformer models.

Problem

Research questions and friction points this paper is trying to address.

Securing Transformer models in untrusted cloud environments

Reducing performance overhead of TEE-based model inference

Ensuring dual protection for both data and model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified TEE and crypto-protected accelerators

Secure offloading of Attention and SoftMax

Dual protection for model and data

🔎 Similar Papers

No similar papers found.