Megrez-Omni Technical Report

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

To address the urgent demand for efficient, robust, and multimodal intelligent models on edge devices, this work proposes a hardware-software co-designed on-device multimodal large language model (MLLM) architecture. We develop two lightweight models—Megrez-3B-Instruct and Megrez-3B-Omni—that jointly integrate language modeling, cross-modal alignment representation learning, hardware-aware training, and quantization-aware compression. Evaluated on image, text, and audio understanding tasks, both models achieve state-of-the-art accuracy among lightweight multimodal models. Megrez-3B-Omni, with only 3 billion parameters, delivers a 2.3× speedup in measured inference latency, enabling low-latency real-time inference and seamless on-device deployment. Our approach significantly enhances the generality, accuracy, and robustness of edge AI systems while maintaining stringent computational and memory constraints typical of resource-constrained edge platforms.

Technology Category

Application Category

📝 Abstract

In this work, we present the Megrez models, comprising a language model (Megrez-3B-Instruct) and a multimodal model (Megrez-3B-Omni). These models are designed to deliver fast inference, compactness, and robust edge-side intelligence through a software-hardware co-design approach. Megrez-3B-Instruct offers several advantages, including high accuracy, high speed, ease of use, and a wide range of applications. Building on Megrez-3B-Instruct, Megrez-3B-Omni is an on-device multimodal understanding LLM that supports image, text, and audio analysis. It achieves state-of-the-art accuracy across all three modalities and demonstrates strong versatility and robustness, setting a new benchmark for multimodal AI models.

Problem

Research questions and friction points this paper is trying to address.

Develops compact, fast inference AI models

Enhances edge-side intelligence via co-design

Advances multimodal understanding across image, text, audio

Innovation

Methods, ideas, or system contributions that make the work stand out.

software-hardware co-design approach

on-device multimodal understanding LLM

state-of-the-art accuracy multimodal AI

🔎 Similar Papers

No similar papers found.