🤖 AI Summary
This work addresses the challenge of enabling fine-grained visual reasoning on user-specified regions without retraining or fine-tuning multimodal large language models (MLLMs). The authors propose ControlMLLM++, a framework that injects learnable visual prompts into a frozen MLLM during inference and optimizes latent visual token modifiers via task-specific energy functions to steer the model’s attention toward target regions. Supporting diverse visual prompts—including bounding boxes, masks, scribbles, and points—the method achieves strong out-of-domain generalization and high interpretability without any additional training. The approach further introduces an Optim++ optimization strategy and a PromptDebias debiasing mechanism, which collectively enhance inference stability and accuracy. Extensive experiments demonstrate consistently superior performance across various prompt modalities.
📝 Abstract
We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.