TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

📅 2026-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenges of computational heterogeneity and hard real-time constraints in deploying multimodal AI models on embedded platforms. The authors propose a reconfiguration-free, single-bitstream FPGA inference engine that unifies DDMM, SDDMM, and SpMM operators to enable runtime-switchable computation modes on a shared processing element (PE) array. Key innovations include in-pipeline token pruning, dependency-aware layer offloading (DALO), a weight/output-stationary systolic array, 1×CS SIMD architecture, a routable adder tree (RADT), and int8 quantization. Evaluated on Alveo U50 and ZCU104 platforms, the design achieves up to 22.57× and 6.86× lower end-to-end latency compared to RTX 4090 and Jetson Orin Nano, respectively. Token pruning yields a 7.8× speedup, DALO improves throughput by 79%, and accuracy degradation remains below 2.5%.

Technology Category

Application Category

📝 Abstract
Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real-time targets leave little slack. TRINE is a single-bitstream FPGA accelerator and compiler that executes end-to-end multimodal inference without reconfiguration. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode-switchable engine that toggles at runtime among weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree (RADT) on a shared PE array. A width-matched, two-stage top-k unit enables in-stream token pruning, while dependency-aware layer offloading (DALO) overlaps independent kernels across reconfigurable processing units to sustain utilization. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W; token pruning alone yields up to 7.8x on ViT-heavy pipelines, and DALO contributes up to 79% throughput improvement. With int8 quantization, accuracy drops remain <2.5% across representative tasks, delivering state-of-the-art latency and energy efficiency for unified vision, language, and graph workloads-in one bitstream.
Problem

Research questions and friction points this paper is trying to address.

multimodal AI
FPGA inference
real-time embedded systems
heterogeneous neural networks
compute/memory divergence
Innovation

Methods, ideas, or system contributions that make the work stand out.

runtime-adaptive
token-aware pruning
multimodal inference
mode-switchable engine
dependency-aware layer offloading
🔎 Similar Papers
No similar papers found.