Zoomer: Adaptive Image Focus Optimization for Black-box MLLM

📅 2025-04-30

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

To address critical visual detail loss in black-box multimodal large language models (MLLMs) caused by token limitations during image understanding, this paper proposes an adaptive image-focusing optimization mechanism. The method integrates three core innovations: prompt-aware dynamic highlighting, spatially preserving visual layout, and budget-aware token allocation. Leveraging lightweight visual saliency modeling, differentiable region cropping, and fine-grained token budget control, it enables efficient visual guidance without accessing model parameters or requiring fine-tuning. Evaluated across multiple benchmark datasets, the approach achieves up to a 26.9% absolute accuracy improvement over baselines while significantly reducing token consumption. It effectively balances global semantic comprehension with local detail perception, demonstrating strong practicality for resource-constrained, black-box MLLM deployment.

Technology Category

Application Category

📝 Abstract

Recent advancements in multimodal large language models (MLLMs) have broadened the scope of vision-language tasks, excelling in applications like image captioning and interactive question-answering. However, these models struggle with accurately processing visual data, particularly in tasks requiring precise object recognition and fine visual details. Stringent token limits often result in the omission of critical information, hampering performance. To address these limitations, we introduce SysName, a novel visual prompting mechanism designed to enhance MLLM performance while preserving essential visual details within token limits. SysName features three key innovations: a prompt-aware strategy that dynamically highlights relevant image regions, a spatial-preserving orchestration schema that maintains object integrity, and a budget-aware prompting method that balances global context with crucial visual details. Comprehensive evaluations across multiple datasets demonstrate that SysName consistently outperforms baseline methods, achieving up to a $26.9%$ improvement in accuracy while significantly reducing token consumption.

Problem

Research questions and friction points this paper is trying to address.

Enhance MLLM performance in vision-language tasks

Improve object recognition and visual detail accuracy

Reduce token consumption while preserving critical information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic prompt-aware strategy for relevant image regions

Spatial-preserving schema maintaining object integrity

Budget-aware method balancing global and local details

🔎 Similar Papers

Improving Low-Light Image Recognition Performance Based on Image-adaptive Learnable Module