Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems

📅 2025-12-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing frameworks for real-time dynamic DNN inference on edge devices suffer from CPU underutilization, high latency, and memory spikes when unsupported operators fall back to the CPU. This work addresses these issues by proposing a branch-aware memory management scheme and an adaptive heterogeneous scheduling mechanism—enabling fine-grained, subgraph-level CPU-accelerator co-execution without model restructuring. Our approach leverages computation DAG partitioning, dedicated memory pool allocation, buffer reuse, and branch-aware scheduling to build a lightweight heterogeneous subgraph execution engine. Evaluated across three mobile devices and five dynamic DNN models, our method reduces end-to-end inference latency by up to 46%, incurs only 26.5% average memory overhead, and improves energy efficiency by up to 30%. To the best of our knowledge, this is the first framework achieving such efficient, transparent, and fine-grained heterogeneous execution for dynamic DNNs on resource-constrained edge platforms.

Technology Category

Application Category

📝 Abstract
The growing demand for real-time DNN applications on edge devices necessitates faster inference of increasingly complex models. Although many devices include specialized accelerators (e.g., mobile GPUs), dynamic control-flow operators and unsupported kernels often fall back to CPU execution. Existing frameworks handle these fallbacks poorly, leaving CPU cores idle and causing high latency and memory spikes. We introduce Parallax, a framework that accelerates mobile DNN inference without model refactoring or custom operator implementations. Parallax first partitions the computation DAG to expose parallelism, then employs branch-aware memory management with dedicated arenas and buffer reuse to reduce runtime footprint. An adaptive scheduler executes branches according to device memory constraints, meanwhile, fine-grained subgraph control enables heterogeneous inference of dynamic models. By evaluating on five representative DNNs across three different mobile devices, Parallax achieves up to 46% latency reduction, maintains controlled memory overhead (26.5% on average), and delivers up to 30% energy savings compared with state-of-the-art frameworks, offering improvements aligned with the responsiveness demands of real-time mobile inference.
Problem

Research questions and friction points this paper is trying to address.

Accelerates mobile DNN inference without model refactoring
Reduces latency and memory overhead from CPU fallbacks
Enables efficient parallel execution on heterogeneous edge devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partitions computation DAG to expose parallelism
Uses branch-aware memory management with arenas
Employs adaptive scheduler for memory constraints
🔎 Similar Papers