Forge-UGC: FX optimization and register-graph engine for universal graph compiler

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the high compilation overhead and inference latency in existing compilation frameworks—such as OpenVINO and ONNX Runtime—stemming from opaque pipelines, coarse-grained optimizations, and inefficient buffer management. To overcome these limitations, we propose the first hardware-agnostic, four-stage general-purpose graph compiler that decouples graph capture, optimization, IR lowering, and backend scheduling, enabling automatic deployment of modern Transformer components. Built on torch.export for ATen-level graph capture, our compiler integrates six categories of graph optimizations, a typed intermediate representation with virtual register allocation, liveness analysis, and linear-scan buffer allocation. We also introduce novel metrics like Fusion Gain Ratio to evaluate NPU compilation efficiency. Evaluated across models ranging from 125M to 8B parameters, our approach achieves 6.9–9.2× faster compilation, 18.2–35.7% lower inference latency, and 30.2–40.9% reduced energy consumption, while preserving high numerical fidelity (KL divergence < 8.4e⁻⁹).

Technology Category

Application Category

📝 Abstract

We present Forge-UGC (FX Optimization and Register-Graph Engine for Universal Graph Compilation), a four-phase compiler for transformer deployment on heterogeneous accelerator hardware, validated on Intel AI Boost NPU. Existing frameworks such as OpenVINO and ONNX Runtime often use opaque compilation pipelines, limited pass-level visibility, and weak buffer management, which can lead to higher compilation cost and runtime overhead. Forge-UGC addresses this with a hardware-agnostic design that separates graph capture, optimization, intermediate representation lowering, and backend scheduling. Phase 1 captures graphs with torch.export at the ATen operator level, supporting modern transformer components such as rotary position embeddings, grouped-query attention, and SwiGLU without manual decomposition. Phase 2 applies six optimization passes: dead code elimination, common subexpression elimination, constant folding, attention fusion, operator fusion, and layout optimization, reducing graph node count by 14.2 to 21.9%. Phase 3 lowers the optimized graph into a typed intermediate representation with explicit virtual register assignments. Phase 4 performs liveness analysis, linear-scan buffer allocation, reducing peak buffer count by 30 to 48%, and device-affinity scheduling, reducing NPU-CPU transitions by 42 to 65%. Across six model families ranging from 125M to 8B parameters, evaluated on WikiText-103 and GLUE, Forge-UGC delivers 6.9 to 9.2x faster compilation than OpenVINO and ONNX Runtime, 18.2 to 35.7% lower inference latency, and 30.2 to 40.9% lower energy per inference. Fidelity is preserved, with max absolute logit differences below 2.1e-5 and KL divergence below 8.4e-9. We also introduce Fusion Gain Ratio, Compilation Efficiency Index, and per-pass execution profiling for systematic evaluation of NPU compilation pipelines.

Problem

Research questions and friction points this paper is trying to address.

graph compilation

heterogeneous accelerators

buffer management

compilation overhead

transformer deployment

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph compilation

transformer optimization

virtual register allocation