TZ-LLM: Protecting On-Device Large Language Models with Arm TrustZone

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the risks of model parameter leakage and the security-efficiency trade-off in deploying large language models (LLMs) on mobile devices, this paper proposes a lightweight secure inference framework leveraging Arm TrustZone. Methodologically, it introduces a TEE-REE co-driven architecture enabling time-division sharing of the NPU to minimize the trusted computing base; designs a pipelined parameter prefetching mechanism integrating deterministic memory access prediction, on-demand decryption, and encrypted memory management; and develops a lightweight NPU data-plane driver. Evaluated on a Rockchip platform, the framework reduces first-token latency by 90.9% and improves decoding throughput by up to 23.2%, achieving a significant balance between robust model intellectual property protection and high-performance local inference.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) deployed on mobile devices offer benefits like user privacy and reduced network latency, but introduce a significant security risk: the leakage of proprietary models to end users. To mitigate this risk, we propose a system design for protecting on-device LLMs using Arm Trusted Execution Environment (TEE), TrustZone. Our system addresses two primary challenges: (1) The dilemma between memory efficiency and fast inference (caching model parameters within TEE memory). (2) The lack of efficient and secure Neural Processing Unit (NPU) time-sharing between Rich Execution Environment (REE) and TEE. Our approach incorporates two key innovations. First, we employ pipelined restoration, leveraging the deterministic memory access patterns of LLM inference to prefetch parameters on demand, hiding memory allocation, I/O and decryption latency under computation time. Second, we introduce a co-driver design, creating a minimal data plane NPU driver in the TEE that collaborates with the full-fledged REE driver. This reduces the TEE TCB size and eliminates control plane reinitialization overhead during NPU world switches. We implemented our system on the emerging OpenHarmony OS and the llama.cpp inference framework, and evaluated it with various LLMs on an Arm Rockchip device. Compared to a strawman TEE baseline lacking our optimizations, our system reduces TTFT by up to 90.9% and increases decoding speed by up to 23.2%.
Problem

Research questions and friction points this paper is trying to address.

Securing proprietary LLMs on mobile devices against user leakage risks
Resolving memory efficiency versus fast inference dilemma in TEE
Enabling secure NPU time-sharing between Rich and Trusted Execution Environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pipelined restoration hides memory latency during LLM inference
Co-driver design enables secure NPU sharing between environments
Leveraging TrustZone to protect on-device LLM model parameters
🔎 Similar Papers
No similar papers found.
X
Xunjie Wang
Institute of Parallel and Distributed Systems, School of Computer Science, Shanghai Jiao Tong University
J
Jiacheng Shi
Institute of Parallel and Distributed Systems, School of Computer Science, Shanghai Jiao Tong University
Zihan Zhao
Zihan Zhao
Shanghai Jiao Tong University
NLP
Y
Yang Yu
Institute of Parallel and Distributed Systems, School of Computer Science, Shanghai Jiao Tong University
Zhichao Hua
Zhichao Hua
Associate Professor, Shanghai Jiao Tong University
operating systemsarchitectureshardware/software co-design and the systems/architectures for LLM.
Jinyu Gu
Jinyu Gu
Shanghai Jiao Tong University
Operating SystemSystem SecurityVirtualization