Flash: A Hybrid Private Inference Protocol for Deep CNNs with High Accuracy and Low Latency on CPU

📅 2024-01-30
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low private inference efficiency and high communication overhead of deep CNNs on commodity CPUs, this paper proposes the first end-to-end efficient protocol for private image classification. Our method integrates homomorphic encryption (HE) with two-party secure computation (2PC), supporting customizable encoding and fast slot rotation. Key contributions include: (1) a low-latency homomorphic convolution algorithm; (2) replacing ReLU with a trainable quadratic polynomial $x^2 + x$, optimized via joint training; and (3) an offline-communication-free 2PC activation evaluation protocol. Experiments show that our protocol achieves CIFAR-100 inference in 0.02 minutes (0.07 GB communication) and TinyImageNet in 0.57 minutes (0.22 GB). Compared to state-of-the-art approaches, it accelerates inference by 16–45× and reduces communication by 84–196×. Notably, even ImageNet inference completes in under one minute with less than 1 GB communication on a single CPU.

Technology Category

Application Category

📝 Abstract
This paper presents Flash, an optimized private inference (PI) hybrid protocol utilizing both homomorphic encryption (HE) and secure two-party computation (2PC), which can reduce the end-to-end PI latency for deep CNN models less than 1 minute with CPU. To this end, first, Flash proposes a low-latency convolution algorithm built upon a fast slot rotation operation and a novel data encoding scheme, which results in 4-94x performance gain over the state-of-the-art. Second, to minimize the communication cost introduced by the standard nonlinear activation function ReLU, Flash replaces the entire ReLUs with the polynomial $x^2+x$ and trains deep CNN models with the new training strategy. The trained models improve the inference accuracy for CIFAR-10/100 and TinyImageNet by 16% on average (up to 40% for ResNet-32) compared to prior art. Last, Flash proposes an efficient 2PC-based $x^2+x$ evaluation protocol that does not require any offline communication and that reduces the total communication cost to process the activation layer by 84-196x over the state-of-the-art. As a result, the end-to-end PI latency of Flash implemented on CPU is 0.02 minute for CIFAR-100 and 0.57 minute for TinyImageNet classification, while the total data communication is 0.07GB for CIFAR-100 and 0.22GB for TinyImageNet. Flash improves the state-of-the-art PI by 16-45x in latency and 84-196x in communication cost. Moreover, even for ImageNet, Flash can deliver the latency less than 1 minute on CPU with the total communication less than 1GB.
Problem

Research questions and friction points this paper is trying to address.

Deep Learning
Image Recognition
Efficiency Enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

High-speed image processing algorithm
Simplified activation function
Efficient computation and communication
🔎 Similar Papers
No similar papers found.
H
Hyeri Roh
Dept. of ECE, ISRC, Seoul National University
J
Jinsu Yeo
Seoul National University, Samsung Research
Y
Yeongil Ko
Harvard University, now at Google
Gu-Yeon Wei
Gu-Yeon Wei
Robert and Suzanne Case Professor of EE and CS, Harvard University
integrated circuitscomputer architecture
David Brooks
David Brooks
Haley Family Professor of Computer Science, Harvard University
Computer Architecture
W
Woo-Seok Choi
Dept. of ECE, ISRC, Seoul National University