Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

📅 2026-03-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes Penguin-VL, a compact vision-language model that departs from conventional reliance on large-scale contrastively pretrained visual encoders, which often suppress fine-grained visual details and hinder deployment on resource-constrained devices. Instead, Penguin-VL uniquely initializes its visual encoder directly from a pure text-based large language model (LLM), eliminating the need for contrastive pretraining. By integrating a lightweight architecture, multimodal alignment, and dense visual-semantic modeling, the model achieves substantially improved visual fidelity and data efficiency. Experimental results demonstrate that Penguin-VL matches or even surpasses state-of-the-art models such as Qwen3-VL across multiple image and video benchmarks, with particularly strong performance in document understanding, visual knowledge reasoning, and multi-view video comprehension, all while maintaining computational efficiency and a small footprint.

Technology Category

Application Category

📝 Abstract
Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL
Problem

Research questions and friction points this paper is trying to address.

Vision Language Model
model scaling
contrastive pretraining
fine-grained visual cues
compute-constrained deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model
LLM-based vision encoder
contrastive pretraining
fine-grained visual representation
compute-efficient VLM
🔎 Similar Papers
No similar papers found.
Boqiang Zhang
Boqiang Zhang
Tencent AILab
Lei Ke
Lei Ke
Researcher, Tencent AI (Seattle)
Computer VisionMachine LearningMulti-modal LLMs
Ruihan Yang
Ruihan Yang
Tencent America
generative modelsneural compression
Q
Qi Gao
Penguin-VL team at Tencent AI Lab
T
Tianyuan Qu
Penguin-VL team at Tencent AI Lab
R
Rossell Chen
Penguin-VL team at Tencent AI Lab
D
Dong Yu
Penguin-VL team at Tencent AI Lab
L
Leoweiliang
Penguin-VL team at Tencent AI Lab