🤖 AI Summary
To address architectural inefficiency and limited training scalability in multimodal large language models (MLLMs), this paper introduces the SPHINX-X series. Methodologically: (1) it employs a lightweight visual encoder and pioneers a single-stage, end-to-end full-modality training paradigm; (2) it proposes a skip-token mechanism that dynamically prunes redundant visual tokens from sub-images; and (3) it constructs a multi-domain hybrid dataset—enriched with OCR-intensive and Set-of-Mark–curated samples—and integrates multi-source data distillation. Experiments across base models—from TinyLlama-1.1B to Mixtral-8×7B—demonstrate a strong positive correlation between parameter count, data scale, and multimodal performance. The models are open-sourced, support multilingual and multi-scale deployment, and achieve significant improvements in cross-task generalization and inference efficiency.
📝 Abstract
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory