🤖 AI Summary
To address the low private inference efficiency and high communication overhead of deep CNNs on commodity CPUs, this paper proposes the first end-to-end efficient protocol for private image classification. Our method integrates homomorphic encryption (HE) with two-party secure computation (2PC), supporting customizable encoding and fast slot rotation. Key contributions include: (1) a low-latency homomorphic convolution algorithm; (2) replacing ReLU with a trainable quadratic polynomial $x^2 + x$, optimized via joint training; and (3) an offline-communication-free 2PC activation evaluation protocol. Experiments show that our protocol achieves CIFAR-100 inference in 0.02 minutes (0.07 GB communication) and TinyImageNet in 0.57 minutes (0.22 GB). Compared to state-of-the-art approaches, it accelerates inference by 16–45× and reduces communication by 84–196×. Notably, even ImageNet inference completes in under one minute with less than 1 GB communication on a single CPU.
📝 Abstract
This paper presents Flash, an optimized private inference (PI) hybrid protocol utilizing both homomorphic encryption (HE) and secure two-party computation (2PC), which can reduce the end-to-end PI latency for deep CNN models less than 1 minute with CPU. To this end, first, Flash proposes a low-latency convolution algorithm built upon a fast slot rotation operation and a novel data encoding scheme, which results in 4-94x performance gain over the state-of-the-art. Second, to minimize the communication cost introduced by the standard nonlinear activation function ReLU, Flash replaces the entire ReLUs with the polynomial $x^2+x$ and trains deep CNN models with the new training strategy. The trained models improve the inference accuracy for CIFAR-10/100 and TinyImageNet by 16% on average (up to 40% for ResNet-32) compared to prior art. Last, Flash proposes an efficient 2PC-based $x^2+x$ evaluation protocol that does not require any offline communication and that reduces the total communication cost to process the activation layer by 84-196x over the state-of-the-art. As a result, the end-to-end PI latency of Flash implemented on CPU is 0.02 minute for CIFAR-100 and 0.57 minute for TinyImageNet classification, while the total data communication is 0.07GB for CIFAR-100 and 0.22GB for TinyImageNet. Flash improves the state-of-the-art PI by 16-45x in latency and 84-196x in communication cost. Moreover, even for ImageNet, Flash can deliver the latency less than 1 minute on CPU with the total communication less than 1GB.