Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

📅 2025-03-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address capability bottlenecks of small-scale multimodal language models in mathematical reasoning, code generation, multilingual support, and long-context modeling, this work introduces Phi-4-Mini (3.8B) and its unified multimodal extension, Phi-4-Multimodal. Methodologically, we propose: (1) a novel modality-specific LoRA router enabling interference-free joint inference over text, vision, and speech; (2) a 200K-token vocabulary integrated with grouped-query attention (GQA) to enhance multilingual coverage and long-sequence efficiency; and (3) a training pipeline combining high-quality synthetic data distillation, MoE-style LoRA adaptation, and multi-stage instruction tuning. Experiments demonstrate that Phi-4-Mini surpasses same-parameter open-source models on mathematical and coding benchmarks—matching the performance of 7B-class models. Phi-4-Multimodal achieves state-of-the-art cross-modal reasoning and outperforms larger models on multimodal benchmarks including OpenASR, establishing new efficiency–capability trade-offs for compact multimodal architectures.

Technology Category

Application Category

📝 Abstract
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source models of similar size and matching the performance of models twice its size on math and coding tasks requiring complex reasoning. This achievement is driven by a carefully curated synthetic data recipe emphasizing high-quality math and coding datasets. Compared to its predecessor, Phi-3.5-Mini, Phi-4-Mini features an expanded vocabulary size of 200K tokens to better support multilingual applications, as well as group query attention for more efficient long-sequence generation. Phi-4-Multimodal is a multimodal model that integrates text, vision, and speech/audio input modalities into a single model. Its novel modality extension approach leverages LoRA adapters and modality-specific routers to allow multiple inference modes combining various modalities without interference. For example, it now ranks first in the OpenASR leaderboard to date, although the LoRA component of the speech/audio modality has just 460 million parameters. Phi-4-Multimodal supports scenarios involving (vision + language), (vision + speech), and (speech/audio) inputs, outperforming larger vision-language and speech-language models on a wide range of tasks. Additionally, we experiment to further train Phi-4-Mini to enhance its reasoning capabilities. Despite its compact 3.8-billion-parameter size, this experimental version achieves reasoning performance on par with or surpassing significantly larger models, including DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-Llama-8B.
Problem

Research questions and friction points this paper is trying to address.

Develop compact, high-performance multimodal language models.
Enhance reasoning and multilingual support in small-scale models.
Integrate multiple input modalities efficiently without performance loss.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Mixture-of-LoRAs for multimodal integration
Expands vocabulary to 200K tokens for multilingual support
Enhances reasoning with curated synthetic data
🔎 Similar Papers
No similar papers found.
A
Abdelrahman Abouelenin
Microsoft
Atabak Ashfaq
Atabak Ashfaq
Microsoft
Text SummarizationNL-to-SQLRAGLLM Post-training
Adam Atkinson
Adam Atkinson
Microsoft Turing + Microsoft Research GenAI
machine learning
H
H. Awadalla
Microsoft
Nguyen Bach
Nguyen Bach
Microsoft
Machine TranslationNatural Language ProcessingSpeech TranslationSpeech RecognitionSummarization
Jianmin Bao
Jianmin Bao
Microsoft Research
Computer VisionAIGCDeep Generative ModelsDeep Learning
Alon Benhaim
Alon Benhaim
Microsoft
Large Language ModelsNatural Language Processing
M
Martin Cai
Microsoft
Vishrav Chaudhary
Vishrav Chaudhary
Microsoft AI
Neural Machine TranslationNatural Language ProcessingMachine Learning
C
Congcong Chen
Microsoft
D
Dongdong Chen
Microsoft
Junkun Chen
Junkun Chen
Microsoft
Natural Language Processing
Weizhu Chen
Weizhu Chen
Microsoft, Technical Fellow
Deep LearningNLPNatural Language Processingmachine learning
Yen-Chun Chen
Yen-Chun Chen
Researcher, Microsoft
Natural Language ProcessingComputer VisionMultimodal AI
Y
Yi-ling Chen
Microsoft
Q
Qi Dai
Microsoft
Xiyang Dai
Xiyang Dai
Microsoft
Computer VisionDeep Learning
R
Ruchao Fan
Microsoft
Mei Gao
Mei Gao
PhD, UCLA
Atmospheric and Oceanic ScienceStatistics
M
Min Gao
Microsoft
A
Amit Garg
Microsoft
Abhishek Goswami
Abhishek Goswami
Microsoft
J
Junheng Hao
Microsoft
A
Amr Hendy
Microsoft
Y
Yuxuan Hu
Microsoft
X
Xin Jin
Microsoft
M
Mahmoud Khademi
Microsoft
D
Dongwoo Kim
Microsoft
Young Jin Kim
Young Jin Kim
Microsoft, Georgia Tech. (CSE)
Large Language ModelsAIMachine LearningParallel Computing
G
Gina Lee
Microsoft
Jinyu Li
Jinyu Li
Partner Applied Science Manager, Microsoft
Acoustic ModelingSpeech RecognitionSpeech Translation
Yunsheng Li
Yunsheng Li
Microsoft
computer vision
C
Chen Liang
Microsoft
X
Xihui Lin
Microsoft
Zeqi Lin
Zeqi Lin
Microsoft
Code GenerationMachine Reasoning
Mengchen Liu
Mengchen Liu
META
Visual AnalyticsMachine Learning
Y
Yang Liu
Microsoft
Gilsinia Lopez
Gilsinia Lopez
Microsoft
NLP
Chong Luo
Chong Luo
Microsoft Research
multimedia communicationscomputer vision
P
Piyush Madan
Microsoft
V
V. Mazalov
Microsoft
A
Ali Mousavi
Microsoft
A
Anh Nguyen
Microsoft
J
Jing Pan
Microsoft
D
D. Perez-Becker
Microsoft
J
Jacob Platin
Microsoft
Thomas Portet
Thomas Portet
Microsoft
K
Kai Qiu
Microsoft
B
Bo Ren
Microsoft
Liliang Ren
Liliang Ren
Microsoft GenAI
Neural NetworksLong Sequence ModelingLarge Language Models
Sambuddha Roy
Sambuddha Roy
Microsoft
Ning Shang
Ning Shang
Microsoft
Yelong Shen
Yelong Shen
Microsoft
NLPMachine Learning
Saksham Singhal
Saksham Singhal
Microsoft
NLPLarge Scale ModelingMultimodal
Subhojit Som
Subhojit Som
Senior Applied Scientist, Microsoft Corporation
Machine LearningSignal Processing
X
Xiaocheng Song
Microsoft
T
Tetyana Sych
Microsoft
Praneetha Vaddamanu
Praneetha Vaddamanu
Applied Scientist, Microsoft Turing
Computer ScienceArtificial Intelligence
Shuohang Wang
Shuohang Wang
Principal Researcher, Microsoft GenAI
Y
Yiming Wang
Microsoft
Z
Zhenghao Wang
Microsoft
Haibin Wu
Haibin Wu
Meta
speech processingmulti-modalspeech synthesisLLM
H
Haoran Xu
Microsoft
Weijian Xu
Weijian Xu
Microsoft
deep learningcomputer visionmulti-modal learning
Y
Yifan Yang
Microsoft
Z
Ziyi Yang
Microsoft
Donghan Yu
Donghan Yu
Apple
NLPLLM
I
I. Zabir
Microsoft
Jianwen Zhang
Jianwen Zhang
Vocaela AI
Small Language ModelsVison Language ModelsGUI AgentsNL-to-SQLKnowledge Graphs
Li Lyna Zhang
Li Lyna Zhang
Microsoft Research Asia
Artificial IntelligenceDeep LearningReinforcement LearningLong-Context
Yunan Zhang
Yunan Zhang
Microsoft
X
Xiren Zhou
Microsoft