VEN-VL: A Visual Ensemble MoE Framework for Effective and Efficient Multi-Modal Understanding

📅 2026-05-25

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Existing efficient multimodal understanding methods suffer from significant performance degradation due to excessive compression of individual visual cues and reliance on heuristic pruning, which limits both information capacity and density. To address this, this work proposes the VEN-VL framework, which adheres to a “enrich-then-compress” principle: it first integrates multi-view visual representations to enhance information capacity, then progressively compresses visual tokens through an adaptive Mixture-of-Experts (MoE) routing mechanism across specialized visual experts, while preserving critical semantics via explicit visual reconstruction supervision. This approach substantially reduces the number of visual tokens while simultaneously improving accuracy and efficiency in multimodal understanding, effectively bridging the gap between performance and computational cost on complex tasks.

📝 Abstract

Despite the remarkable progress achieved by recent efficient methods in accelerating multimodal understanding, they still suffer from noticeable performance degradation. Their emphasis on the high compression ratio of a single visual clue and reliance on the heuristic pruning strategy with coarse attention alignment incurs a bottleneck on the information capacity and density of visual tokens. Addressing this limitation, we propose VEN-VL, a visual ensemble MoE framework for effective and efficient perception following the enrich then compact principle. Specifically, we first enrich the information capacity by unifying the visual representations of different perspectives, and then progressively compact it with adaptive routers in specialized visual experts to enhance the information density. Furthermore, we incorporate the reconstruction ability of vanilla structure via explicit visual supervision, facilitating crucial information preservation. Experimental results demonstrate our superiority in complex visual tasks with few information-condensed tokens, which effectively bridges the gap between performance and efficiency.

Problem

Research questions and friction points this paper is trying to address.

multimodal understanding

visual token compression

information capacity

attention alignment

performance degradation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Ensemble

Mixture of Experts (MoE)

Information Density