🤖 AI Summary
To address the limitations in Traditional Chinese multimodal understanding and the challenges of lightweight deployment, this paper introduces the Breeze-2 series (3B/8B), built upon the Llama 3 architecture and featuring the first-ever Traditional Chinese–optimized multimodal design: (i) integration of a ViT-based visual encoder with a cross-modal bridging module for end-to-end joint image-text modeling; and (ii) incorporation of template-guided function calling and multi-task alignment fine-tuning to enhance instruction following and tool-use robustness. Evaluated on benchmarks including Taiwan-specific commonsense reasoning, long-context comprehension, visual question answering, and function calling, Breeze-2 achieves state-of-the-art performance across all tasks. The 3B variant supports efficient on-device deployment. All models are open-sourced under the Llama 3 Community License, establishing the first publicly available, production-ready foundation for Traditional Chinese multimodal research and applications.
📝 Abstract
Breeze 2 is a suite of advanced multi-modal language models, available in 3B and 8B parameter configurations, specifically designed to enhance Traditional Chinese language representation. Building upon the Llama 3, Breeze 2 continues pretraining on an extensive corpus to enhance the linguistic and cultural heritage of Traditional Chinese. It incorporates vision-aware capabilities through a visual encoder and a bridge module, and supports function-calling via prompt templates and post-training on function-calling data. The effectiveness of Breeze 2 is benchmarked across various tasks, including Taiwan general knowledge, instruction-following, long context, function calling, and vision understanding. Furthermore, we showcase the capabilities of the its 3B model in a mobile application. We are publicly releasing all Breeze 2 models under the Llama 3 Community License.