AutoNeural: Co-Designing Vision-Language Models for NPU Inference

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) suffer from low inference efficiency on edge NPUs due to ViT quantization fragility and I/O bottlenecks in autoregressive attention. To address this, we propose an NPU-native co-design framework: (1) a MobileNetV5-style visual backbone for enhanced INT4/8/16 quantization robustness; (2) a linear-complexity language backbone integrating state space models (SSMs) and gated convolutions to eliminate KV cache overhead; and (3) hardware-aware compilation via depthwise separable convolutions, integer-only quantized inference, and auto-encoding optimization. Experiments show that, versus baselines, our approach reduces visual encoder quantization error by 7×, decreases end-to-end latency by 14×, increases decoding throughput by 3×, and extends context length by 4×. The framework achieves real-time in-cabin multimodal inference on the Qualcomm SA8295P chip.

Technology Category

Application Category

📝 Abstract
While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.
Problem

Research questions and friction points this paper is trying to address.

Optimizes Vision-Language Models for efficient NPU inference.
Reduces quantization brittleness and I/O bottlenecks in edge AI.
Enables real-time multimodal applications on resource-limited hardware.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Replaces ViT with MobileNetV5 for stable quantization
Combines SSM with Transformer for linear-time decoding
Eliminates KV cache overhead via hybrid architecture
🔎 Similar Papers
No similar papers found.
W
Wei Chen
Nexa AI
L
Liangmin Wu
Geely Auto
Yunhai Hu
Yunhai Hu
NYU
NLPLLM
Z
Zhiyuan Li
Nexa AI
Zhiyuan Cheng
Zhiyuan Cheng
Purdue University
AI SecurityTrustworthy ML
Y
Yicheng Qian
Nexa AI
L
Lingyue Zhu
Nexa AI
Z
Zhipeng Hu
Nexa AI
L
Luoyi Liang
Geely Auto
Q
Qiang Tang
Geely Auto
Z
Zhen Liu
Geely Auto
H
Han Yang
Geely Auto