TRINE: A Token-Aware, Runtime-Adaptive FPGA Inference Engine for Multimodal AI

📅 2026-03-24

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the challenges of computational heterogeneity and hard real-time constraints in deploying multimodal AI models on embedded platforms. The authors propose a reconfiguration-free, single-bitstream FPGA inference engine that unifies DDMM, SDDMM, and SpMM operators to enable runtime-switchable computation modes on a shared processing element (PE) array. Key innovations include in-pipeline token pruning, dependency-aware layer offloading (DALO), a weight/output-stationary systolic array, 1×CS SIMD architecture, a routable adder tree (RADT), and int8 quantization. Evaluated on Alveo U50 and ZCU104 platforms, the design achieves up to 22.57× and 6.86× lower end-to-end latency compared to RTX 4090 and Jetson Orin Nano, respectively. Token pruning yields a 7.8× speedup, DALO improves throughput by 79%, and accuracy degradation remains below 2.5%.

Technology Category

Application Category

📝 Abstract

Multimodal stacks that mix ViTs, CNNs, GNNs, and transformer NLP strain embedded platforms because their compute/memory patterns diverge and hard real-time targets leave little slack. TRINE is a single-bitstream FPGA accelerator and compiler that executes end-to-end multimodal inference without reconfiguration. Layers are unified as DDMM/SDDMM/SpMM and mapped to a mode-switchable engine that toggles at runtime among weight/output-stationary systolic, 1xCS SIMD, and a routable adder tree (RADT) on a shared PE array. A width-matched, two-stage top-k unit enables in-stream token pruning, while dependency-aware layer offloading (DALO) overlaps independent kernels across reconfigurable processing units to sustain utilization. Evaluated on Alveo U50 and ZCU104, TRINE reduces latency by up to 22.57x vs. RTX 4090 and 6.86x vs. Jetson Orin Nano at 20-21 W; token pruning alone yields up to 7.8x on ViT-heavy pipelines, and DALO contributes up to 79% throughput improvement. With int8 quantization, accuracy drops remain <2.5% across representative tasks, delivering state-of-the-art latency and energy efficiency for unified vision, language, and graph workloads-in one bitstream.

Problem

Research questions and friction points this paper is trying to address.

multimodal AI

FPGA inference

real-time embedded systems

heterogeneous neural networks

compute/memory divergence

Innovation

Methods, ideas, or system contributions that make the work stand out.

runtime-adaptive

token-aware pruning

multimodal inference