DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the limitations of existing deep learning frameworks, which rely on static sequential execution models that hinder efficient exploitation of intra-device parallelism, resulting in suboptimal hardware utilization and requiring extensive model-specific code to integrate parallelization strategies. To overcome this, the paper introduces a novel paradigm that decouples logical model definition from physical execution scheduling. By leveraging frontend annotations and a programmable scheduling interface, developers can flexibly specify parallelism strategies without altering the model architecture. The system integrates graph partitioning, asynchronous control flow, and zero-copy memory management, while remaining compatible with optimizations such as TorchInductor and CUDA Graphs. Evaluated across six mainstream models, the approach enables integration of common parallelization schemes with minimal code changes and achieves up to a 1.29× throughput improvement. The implementation is publicly released.

📝 Abstract

Intra-device parallelism addresses resource under-utilization in ML inference and training by overlapping the execution of operators with different resource usage. However, its wide adoption is hindered by a fundamental conflict with the static, sequential programming model of existing frameworks. Integrating these strategies requires invasive, model-specific code overhauls, representing an intractable engineering cost. This is further amplified by the high sensitivity of strategies to execution contexts (e.g., workload, model architecture, hardware), forcing developers to implement and maintain multiple specialized solutions. To address this, we propose DynaFlow, a framework that enables the transparent and flexible integration of intra-device parallelism by decoupling the logical model definition from the physical execution schedule. DynaFlow introduces a flexible frontend with annotations for graph partitioning and a programmable interface for defining custom intra-device parallelism strategies. Its efficient backend manages complex control/data-flow asynchronously, uses custom memory management to eliminate copy overheads, and preserves compatibility with optimizations like CUDA Graphs and TorchInductor. We demonstrate that DynaFlow can integrate representative parallelism strategies into 6 state-of-the-art ML systems with minimal code changes, achieving up to a 1.29x throughput improvement. DynaFlow is publicly available at https://github.com/uw-syfi/DynaFlow.

Problem

Research questions and friction points this paper is trying to address.

intra-device parallelism

ML inference

static programming model

resource under-utilization

execution context

Innovation

Methods, ideas, or system contributions that make the work stand out.

intra-device parallelism

programmable scheduling

transparent integration