Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

📅 2026-04-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
Multimodal language models struggle to effectively leverage pixel-level dense representations from vision tools—such as depth or optical flow—resulting in limited perceptual capabilities and overreliance on linguistic priors. This work proposes Perception Programs (P²), a training-free, model-agnostic approach that, for the first time, translates vision tool outputs into compact, structured, and language-native natural language summaries via procedural rules, enabling direct parsing and reasoning by language models. Departing from the conventional paradigm of feeding raw pixel features, P² achieves an average 22% improvement across six perception tasks in the BLINK benchmark. Notably, GPT-5 Mini attains 86.47% accuracy in multi-view reasoning and 81.45% in relative depth estimation, while smaller models also gain absolute improvements of 15–40%, establishing new state-of-the-art results.

Technology Category

Application Category

📝 Abstract
Multimodal language models (MLLMs) are increasingly paired with vision tools (e.g., depth, flow, correspondence) to enhance visual reasoning. However, despite access to these tool-generated visual cues, MLLMs often fail to benefit from them. Existing approaches typically feed raw tool outputs into the model, but these dense, pixel-level representations are misaligned with the language-native reasoning strengths of LLMs, leading to weak perception and reliance on language priors. We argue that, in problems where vision tools can provide the necessary visual cues, the bottleneck is not more tool calls or larger MLLMs, it is how tool outputs are represented. We introduce Perception Programs (P$^2$), a training-free, model-agnostic method that rewrites tool outputs into compact, structured, language-native summaries that MLLMs can directly parse and reason over. Across six perception-centric tasks in BLINK, P$^2$ consistently yields large improvements over base models and raw tool-augmented baselines. With GPT-5 Mini as the base model, P$^2$ raises its accuracy from 41.35\% to 86.47\% on multi-view reasoning, from 52.42\% to 81.45\% on relative depth, and achieves a 22\% average gain across tasks, setting new state-of-the-art results. Even on smaller MLLMs, e.g., InternVL3.5-4B and Qwen3VL-4B, we observe 15-40\% absolute gains from P$^2$, surpassing prior agentic, supervised, and RL-based tool-use methods-without any training or model modifications.
Problem

Research questions and friction points this paper is trying to address.

multimodal language models
visual reasoning
vision tools
tool output representation
perception
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perception Programs
multimodal language models
visual tool reasoning
language-native representation
training-free method