Test-Time Computing for Referring Multimodal Large Language Models

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of enabling fine-grained visual reasoning on user-specified regions without retraining or fine-tuning multimodal large language models (MLLMs). The authors propose ControlMLLM++, a framework that injects learnable visual prompts into a frozen MLLM during inference and optimizes latent visual token modifiers via task-specific energy functions to steer the model’s attention toward target regions. Supporting diverse visual prompts—including bounding boxes, masks, scribbles, and points—the method achieves strong out-of-domain generalization and high interpretability without any additional training. The approach further introduces an Optim++ optimization strategy and a PromptDebias debiasing mechanism, which collectively enhance inference stability and accuracy. Extensive experiments demonstrate consistently superior performance across various prompt modalities.

Technology Category

Application Category

📝 Abstract
We propose ControlMLLM++, a novel test-time adaptation framework that injects learnable visual prompts into frozen multimodal large language models (MLLMs) to enable fine-grained region-based visual reasoning without any model retraining or fine-tuning. Leveraging the insight that cross-modal attention maps intrinsically encode semantic correspondences between textual tokens and visual regions, ControlMLLM++ optimizes a latent visual token modifier during inference via a task-specific energy function to steer model attention towards user-specified areas. To enhance optimization stability and mitigate language prompt biases, ControlMLLM++ incorporates an improved optimization strategy (Optim++) and a prompt debiasing mechanism (PromptDebias). Supporting diverse visual prompt types including bounding boxes, masks, scribbles, and points, our method demonstrates strong out-of-domain generalization and interpretability. The code is available at https://github.com/mrwu-mac/ControlMLLM.
Problem

Research questions and friction points this paper is trying to address.

test-time adaptation
multimodal large language models
visual reasoning
region-based grounding
referring expression
Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time adaptation
visual prompting
multimodal large language models
cross-modal attention
prompt debiasing
🔎 Similar Papers
No similar papers found.
Mingrui Wu
Mingrui Wu
XMU
MLLMT2I
H
Hao Chen
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Jiayi Ji
Jiayi Ji
Rutgers University
X
Xiaoshuai Sun
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Zhiyuan Liu
Zhiyuan Liu
Tsinghua University
autonomous drivingtraffic simulation
L
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.
Ming-Ming Cheng
Ming-Ming Cheng
Professor of Computer Science, Nankai University
Computer VisionComputer GraphicsVisual AttentionSaliency
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China.