🤖 AI Summary
To address the urgent demand for efficient, robust, and multimodal intelligent models on edge devices, this work proposes a hardware-software co-designed on-device multimodal large language model (MLLM) architecture. We develop two lightweight models—Megrez-3B-Instruct and Megrez-3B-Omni—that jointly integrate language modeling, cross-modal alignment representation learning, hardware-aware training, and quantization-aware compression. Evaluated on image, text, and audio understanding tasks, both models achieve state-of-the-art accuracy among lightweight multimodal models. Megrez-3B-Omni, with only 3 billion parameters, delivers a 2.3× speedup in measured inference latency, enabling low-latency real-time inference and seamless on-device deployment. Our approach significantly enhances the generality, accuracy, and robustness of edge AI systems while maintaining stringent computational and memory constraints typical of resource-constrained edge platforms.
📝 Abstract
In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.