Scope: Selective Cross-modal Orchestration of Visual Perception Experts

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing vision-language models (VLMs) rely on stacking multiple visual encoders, incurring high computational overhead and diminishing marginal gains. Method: We propose SCOPE, a novel framework featuring instance-level routing to dynamically select the most suitable visual encoder; a shared encoder coupled with a multi-expert pool; text–vision cross-attention for cross-modal perception optimization; and a dual-entropy regularization strategy jointly enforcing load balancing and routing confidence. SCOPE adopts a Mixture-of-Encoders architecture with a lightweight router and auxiliary loss training. Contribution/Results: With only one shared plus one routed encoder, SCOPE surpasses baselines using four additional encoders across mainstream VLM tasks. It achieves significant performance gains while reducing inference FLOPs by 24%–49%, thereby breaking the conventional paradigm of fixed multi-encoder aggregation.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) benefit from multiple vision encoders, but naively stacking them yields diminishing returns while multiplying inference costs. We propose SCOPE, a Mixture-of-Encoders (MoEnc) framework that dynamically selects one specialized encoder per image-text pair via instance-level routing, unlike token-level routing in traditional MoE. SCOPE maintains a shared encoder and a pool of routed encoders. A lightweight router uses cross-attention between text prompts and shared visual features to select the optimal encoder from the routed encoders. To train this router, we introduce dual entropy regularization with auxiliary losses to balance dataset-level load distribution with instance-level routing confidence. Remarkably, SCOPE with one shared plus one routed encoder outperforms models using all four extra encoders simultaneously, while reducing compute by 24-49%. This demonstrates that intelligent encoder selection beats brute-force aggregation, challenging the prevailing paradigm in multi-encoder VLMs.

Problem

Research questions and friction points this paper is trying to address.

Dynamic encoder selection for vision-language models

Reducing computation costs in multi-encoder systems

Optimizing cross-modal routing between text and images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Encoders framework with instance-level routing

Cross-attention router selects optimal encoder per image-text pair

Dual entropy regularization balances load distribution and confidence

🔎 Similar Papers

POV Learning: Individual Alignment of Multimodal Models using Human Perception