Test-Time Computing for Referring Multimodal Large Language Models

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of enabling fine-grained visual reasoning on user-specified regions without retraining or fine-tuning multimodal large language models (MLLMs). The authors propose ControlMLLM++, a framework that injects learnable visual prompts into a frozen MLLM during inference and optimizes latent visual token modifiers via task-specific energy functions to steer the model’s attention toward target regions. Supporting diverse visual prompts—including bounding boxes, masks, scribbles, and points—the method achieves strong out-of-domain generalization and high interpretability without any additional training. The approach further introduces an Optim++ optimization strategy and a PromptDebias debiasing mechanism, which collectively enhance inference stability and accuracy. Extensive experiments demonstrate consistently superior performance across various prompt modalities.

Technology Category

Application Category

📝 Abstract

We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.

Problem

Research questions and friction points this paper is trying to address.

test-time adaptation

multimodal large language models

visual reasoning

region-based grounding

referring expression

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time adaptation

visual prompting

multimodal large language models