ZAYA1-VL-8B Technical Report

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the limited performance of compact vision-language models on image understanding, reasoning, and counting tasks by proposing a parameter-efficient mixture-of-experts architecture built upon the authors’ in-house language model, ZAYA1-8B. The approach significantly enhances multimodal capabilities through several key innovations: vision-specific LoRA adapters, bidirectional attention among image tokens, sequence packing, and customized attention masking strategies. With a total of 9.2 billion parameters—of which only 1.4 billion are activated per forward pass—the resulting model surpasses Qwen2.5-VL-3B across multiple benchmarks and achieves performance comparable to Molmo2-4B and InternVL3.5-4B, demonstrating an exceptional balance between computational efficiency and empirical effectiveness.

📝 Abstract

We present ZAYA1-VL-8B, a compact mixture-of-experts vision-language model built upon our in-house language model, ZAYA1-8B. Despite its compact size, ZAYA1-VL achieves performance competitive with leading base models such as Molmo2-4B and InternVL3.5-4B, while surpassing models including Qwen2.5-VL-3B, PLM-3B, and MolmoE-1B across a range of image understanding, reasoning, and counting benchmarks. The architecture incorporates two key innovations: (1) vision-specific LoRA adapters integrated into the LLM to increase modality-specific capacity without increasing the number of experts, and (2) bidirectional attention over image tokens within the LLM to enhance visual understanding. We detail the full training pipeline including data composition at each stage, sequence packing, and the attention masking scheme. The model comprises 9.2B total parameters, with 1.4B active parameters including the vision encoder, and is publicly available at https://huggingface.co/Zyphra/ZAYA1-VL.

Problem

Research questions and friction points this paper is trying to address.

vision-language model

mixture-of-experts

image understanding

compact model

multimodal reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts

Vision-Language Model

LoRA Adapters