Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU

📅 2024-01-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

To address the low private inference efficiency and high communication overhead of deep CNNs on commodity CPUs, this paper proposes the first end-to-end efficient protocol for private image classification. Our method integrates homomorphic encryption (HE) with two-party secure computation (2PC), supporting customizable encoding and fast slot rotation. Key contributions include: (1) a low-latency homomorphic convolution algorithm; (2) replacing ReLU with a trainable quadratic polynomial $x^2 + x$, optimized via joint training; and (3) an offline-communication-free 2PC activation evaluation protocol. Experiments show that our protocol achieves CIFAR-100 inference in 0.02 minutes (0.07 GB communication) and TinyImageNet in 0.57 minutes (0.22 GB). Compared to state-of-the-art approaches, it accelerates inference by 16–45× and reduces communication by 84–196×. Notably, even ImageNet inference completes in under one minute with less than 1 GB communication on a single CPU.

Technology Category

Application Category

📝 Abstract

This paper presents Flash, an optimized private inference (PI) hybrid protocol utilizing both homomorphic encryption (HE) and secure two-party computation (2PC), which can reduce the end-to-end PI latency for deep CNN models less than 1 minute with CPU. To this end, first, Flash proposes a low-latency convolution algorithm built upon a fast slot rotation operation and a novel data encoding scheme, which results in 4-94x performance gain over the state-of-the-art. Second, to minimize the communication cost introduced by the standard nonlinear activation function ReLU, Flash replaces the entire ReLUs with the polynomial $x^2+x$ and trains deep CNN models with the new training strategy. The trained models improve the inference accuracy for CIFAR-10/100 and TinyImageNet by 16% on average (up to 40% for ResNet-32) compared to prior art. Last, Flash proposes an efficient 2PC-based $x^2+x$ evaluation protocol that does not require any offline communication and that reduces the total communication cost to process the activation layer by 84-196x over the state-of-the-art. As a result, the end-to-end PI latency of Flash implemented on CPU is 0.02 minute for CIFAR-100 and 0.57 minute for TinyImageNet classification, while the total data communication is 0.07GB for CIFAR-100 and 0.22GB for TinyImageNet. Flash improves the state-of-the-art PI by 16-45x in latency and 84-196x in communication cost. Moreover, even for ImageNet, Flash can deliver the latency less than 1 minute on CPU with the total communication less than 1GB.

Problem

Research questions and friction points this paper is trying to address.

Deep Learning

Image Recognition

Efficiency Enhancement

Innovation

Methods, ideas, or system contributions that make the work stand out.

High-speed image processing algorithm

Simplified activation function

Efficient computation and communication

🔎 Similar Papers

No similar papers found.