Phoenix-VL 1.5 Medium Technical Report

📅 2026-05-11

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This study addresses the challenge of achieving deep regional adaptation—specifically for Singapore—while preserving strong general-purpose multimodal and multilingual capabilities. Building upon the Mistral Medium 3.1 architecture, we develop a native 123B-parameter multimodal multilingual foundation model through trillion-token-scale localized pretraining, long-context extension, Singapore-specific image-text and textual post-training, and online direct preference optimization (Online DPO). Our work is the first to demonstrate a near-lossless balance between sovereign AI–driven regional specialization and broad general intelligence. We further introduce a novel evaluation framework and safety behavior protocol aligned with local knowledge, institutions, and regulatory norms. The resulting model achieves state-of-the-art performance on Singapore-specific multimodal, legal, and policy benchmarks at its scale, while maintaining competitive results on global multimodal, multilingual, and STEM tasks.

📝 Abstract

We introduce Phoenix-VL 1.5 Medium, a 123B-parameter natively multimodal and multilingual foundation model, adapted to regional languages and the Singapore context. Developed as a sovereign AI asset, it demonstrates that deep domain adaptation can be achieved with minimal degradation to broad-spectrum intelligence and alignment. Continued pretraining was performed on Mistral Medium 3.1 using a localized 1-trillion tokens multimodal corpus, followed by a 250-billion tokens long-context extension phase. Subsequent post-training incorporated a novel human-annotated Singapore multimodal dataset and curated textual corpus on Singapore culture, knowledge, and legislation, totaling 22-billion tokens. An additional 5 billion tokens of model alignment was performed through Online Direct Preference Optimization. Phoenix-VL 1.5 Medium achieves state-of-the-art performance for its size on Singapore multimodal, legal, and government policy benchmarks while remaining globally competitive on general multimodal intelligence, multilingual, and STEM benchmarks. We also introduce a novel evaluation suite encompassing localized knowledge benchmarks and an institutionally aligned model behavior and safety framework. We report the data curation principles, training methodology, and highlight benchmark and inference performance.

Problem

Research questions and friction points this paper is trying to address.

multimodal foundation model

domain adaptation

multilingual AI

localized knowledge

sovereign AI

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal foundation model

domain adaptation

sovereign AI